Graph Gist Winter Challenge Winners

February 26, 2014, 2:02 am

≫ Next: From the Neo4j Community: Best of May 2014

To be honest, we were blown away.

When starting this challenge we were really excited and curious about the results. But what YOU created and submitted is just impressive.

We received 65 submissions in the 10+ categories. Well done!

Make sure to check them out, each one is a jewel on its own and there are many surprises hidden in these submissions. And if you get started with Neo4j one of these domains you might already have your modeling and use-case work halfway done. So before starting a proof of concept project, have a look.

You can certainly imagine, it was really hard for us to choose the winners (for sheer volume and quality). The quality of the submissions is really astonishing and we hope it will get even better when the feedback from the commenting sections is taken into account.

Everyone who participated will receive a Neo4j T-Shirt (if postal address and size was submitted here) and the winners will get an Amazon gift certificate (300,150,50 USD).

But without further ado let’s look at the categories and the winners:

Education

Organization Learning by @luannem – covering your path through courses and certifications in a learning management system.
Degrees offered by the University of Oviedo by @leyvanegri – solving use-cases for students at a university.
Interpreting Citation Patterns in Academic Publications: A research aid by Jonatan Jäderberg – an advanced use of graphs to connect scientific papers.

Finance

Graphing our way through the ICIJ offshore jurisdiction data by @hermansm – an impressive investigative tracking of leaked data sets about (legal) company activities.
Finance and Asset Management by @rushugroup is an interesting set of financial portfolio analytics use-cases.
Options Trading As A Graph by @lyonwj looks at how to model the tricky business of option trading in a graphy way.

ZPo2AYE_soWMIHHtrPn1kam5s5DOqpioGe5NSXGEXgEvULl9GAVhq91HxGFyIhKdkiY3cgPVvY212-o4dCKeCNziPUpfXduyVeMaJSEAdYU9Ff2w6tny7kHdmg

Life Science

Medicine & drugs classification for the Central Hospital of Asturias by @Roqueeeeee and @luigi9215 is an impressive representation of drug-related use-cases for a hospital.
Competitive Intelligence in Cancer Drug Discovery by @livedataconcept cleanly models and queries available cancer drugs.
DoctorFinder! by @fbiville & the VIDAL team is a real life application on how to find the drugs and doctors for your symptoms.

Manufacturing

Project Management by @_nicolemargaret shows how graphs are perfect for dependency management in an incremental fashion.
Car Manufacturers 2013 by @fernanvic1 explores the intricate network of car manufacturers, their brands, investments and models.
Device manufacture trends by @shantaramw let’s you glimpse on how graphs can also exploited for business intelligence use-cases.

Sports

Alpine Skiing seasons by @pac_19 uses an intricate model to map the real FIS data into the graph to find some really cool insights.
F1 2012/2013 Season by @el_astur answers many different questions by looking at Formula one racing data.
League of Legends eSports – LCS by @SurrealAnalysis looks at different analytical statistics of the League Championship Series.

pNwNLRT3YCM_BupsTz68eC0BPQ-KQr4fadPKbLPpmy-W05ksiupqGXos1HwNhsL-1oNogXJh1gJylkqAmkVIpK5fjw02HA9sCRqQelm6QXEUtFPdGGghWUTFew

Resources

EPublishing: A graphical approach to digital publications by @deepeshk79 impressively covering a lot of different use-cases in the publication domain and workflow.
Piping Water by @shaundaley1 looks at London’s pipe system and how that natural graph could be managed by using a graph database.
QLAMRE: Quick Look at Mainstream Renewables Energies by @Sergio_Gijon is a quick look at categorizations of renewable energies.

The Antarctic Research: The Effect of Funding & Social Connections in the US Antarctic Program by @openantarctica is really impressive but sadly not eligible as the demo dataset used is too large for the limited scope.

Retail

Food Recommendation by @gromajus uses a graph model of food, ingredients and recipes to compute recommendations, taking preferences and allergies into account.
Single Malt Scotch Whisky by @patbaumgartner is my personal favorite, you certainly know why Ardbeg 17 is the best.
Phone store by @xun91 uses phone models, attributes, manufacturers and stock information to make recommendations for customers.

Telecommunication

Amazon Web Services Global Infrastructure Graph by @AIDANJCASEY represents all regions, zone, services and instance types as a graph awesome for just browsing or finding the best or cheapest offering.
Geoptima Event Log Collection Data Management by @craigtaverner is a really involved but real world model of mobile network event and device data tracking.
Mobile Operators in India by @rushugroup is a basic graph gist exploring the Indian phone network by device technology and operators.

iZO-QRw9MPKQbyv-c4Dx6RPOId92ju1c6vN1bvdY8Z1EdCG-wz4ZxmJCz0l2VZQ3YySJi0m7eSK6qliDS_FslXX4ggTh6rObo8qaJozjPedN7-cEfEM6CwsTEg

Transport

Transport and routing is a great domain for graphs and we see a lot of potential here, unfortunately the sandbox is not well suited for the some of the large demo datasets, so some of the entries did not qualify.

Roads, Nodes and Automobiles by @tekiegirl shows how user provided road maps could be represented in a graph and what can you do with it. There are great example queries for the M3 and M25 motorways in the UK.
Bombay Railway Routes by @luannem shows advanced routing queries for the infamous railway network.
Trekking and Mountaineering routing by @shantaramw Himalayan routes in a graph are not just for hard-core trekkers and bikers, with useful answers.

Advanced Graph Gists

As expected this has been most impressive, people really went far and wide to show what’s possible with graphs and graph-gists. Really hard to choose in this category.

Movie Recommendations with k-NN and Cosine Similarity by @_nicolemargaret Nicole really shows off, computing, storing and using similarities between people for movie rating.
Skip Lists in Cypher by @wefreema – a graph is a universal data structure, why not use it for other data structures too. Wes shows how with a full blown skip list implementation with Cypher.
Small Social Networking Website by @RaulEstrada this is not over the top like others but a really good and comprehensive example on what graphs are good for.

mUG4oonxaenZpCmtfV6uqYmGASCVapqam4MFSIfguY58fmQ9S0ksNhVvhoz8McS1HBRXKmtvi-u6JgIfcdjND4eP6FQFeENJlbwz61oITMggGzOEKBi8FFIedQ

Other

This category unintentionally sneaked in but had some really good submissions. So we also award some prizes here. It’s like the little brother of the Advanced category.

Embedded Metamodel Subgraphs in the FactMiners Social-Game Ecosystem Part 2 by @Jim_Salmons explores the possibilities of using data and meta-data in the same graph structure and which additional information you can infer about your data.
Legislative System Graph by @yaravind is an impressive collection of use cases on top of electorate data.
User, Functions, Applications, or “Slicing onion with an axe” by @karol_brejna covers resource and permission management of an IT infrastructure.

I not only want to thank all of you who contributed, but also our awesome judging team (Mark, Wes, Luanne, Jim, Kenny, Anders, Chris) who spent a lot of time looking at the individual GraphGists and provided valuable feedback in the comment sections. So please authors thank them by updating your gists and taking those comments into account!

As we want you to always publish your awesome graph models, we’d like you to know:

Everyone who, now or in the future, submits a GraphGist on a new topic

via this form will get a t-shirt from us.

GraphGists are a great initial graph model for anyone starting with Graphs and Neo4j.

That’s why we want you to vote on the gists your really like or found helpful.

Thank you!

In case you wonder what the “Rules for a good GraphGist” are, that we used for judging, here are some of them. So if you work on a GraphGist in the future, please keep them in mind:

interesting/insightful domain
a good number of realistic use-cases with sensible result output
description, model picture should be easy to understand
sensible dataset size (at most 150 nodes 300 rels)
good use of the GraphGist tools (table, graph, hide-setup etc)
we had an epiphany while looking at the gist

And last but not least a special treat. The structr team has added GraphGist import to structr so you can automatically create a schema and import the initial dataset into your graph-based application. Then add some use-case endpoints and you’re done.

Michael for the Neo4j Team Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

The post Graph Gist Winter Challenge Winners appeared first on Neo4j Graph Database.

↧

From the Neo4j Community: Best of May 2014

May 31, 2014, 5:31 pm

≫ Next: Let Graph-dom Ring! Four GraphDB Reads for the Fourth of July

≪ Previous: Graph Gist Winter Challenge Winners

Our community is awesome! From blog posts to videos to GraphGists, graphistas around the world regularly contribute some graph-tastic stuff. Below is a list of some of our favorites from the past month:

ARTICLES:

Neo4j + Spring Data – a natural fit for my data by Josh Long, Michael Hunger and Matti Tahvonen

Implementing Word Ladder game using Neo4j by Nikhil Kuriakose

Docker Neo4j by Jun Matsushita

Introducing GraphAware Neo4j Framework by Michal Bachman

Network management and impact analysis with Neo4j

Cool Data Viz: Tom Sawyer Software

Neo4j 2.0 and Keylines: Making the most of labels

How facebooks gatekeeper works. Maybe… by Jay Conway

Pathfinding Demystified (Part 1): Introduction by Gabriel Gambetta

Going places faster with a graph database by Rik Van Bruggen on bdaily.co.uk

Data Modeling in Graph Databases: Interview with Jim Webber and Ian Robinson on InfoQ

A CRM with Neo4j and REST by Dan Schaefer

About the Robustness of Neo4j by Axel Morgner

Time-Based Versioned Graphs by Ian Robinson

Fraud: spot the pattern by Philip Rathle and Gorka Sadowski

Neo4j, RDF and Kevin Bacon by Tom Morris

PrimeFaces + Spring Data + Neo4j Integration by Amr Mohammed

Visualizing an xml as a graph – Neo4j 101 by Nikhil Kuriakose

A week of IT and graph of the week by Mike Holdsworth

Can graphs help fight gangs? and Analysing the Offshore Leaks with graphs by Jean Villedieu

GRAPHGISTS:

Network versioning using relationnodes by Tom Zeppenfeldt

SLIDES:

Spreadsheets are graphs too: Using Neo4J as backend to store spreadsheet information by Felienne Hermans

Exploring Election Results with Neo4j by David Simons

Domain vs Data Centric Graphs by Tareq Abedrabbo

TRAININGS:

Up and Running with Neo4j by Duane Nickull

VIDEOS:

Data-Driven Applications with Spring and Neo4j with Michael Hunger and Josh Long

Dem-O Bones by Rik Van Bruggen

Running Neo4j in Production: Tips, Tricks and Optimizations with David Fox from Snap Interactive at NYC Neo4j Meetup

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

The post From the Neo4j Community: Best of May 2014 appeared first on Neo4j Graph Database.

↧

Let Graph-dom Ring! Four GraphDB Reads for the Fourth of July

July 4, 2014, 2:12 am

≫ Next: From the Neo4j Community: Best of June 2014

≪ Previous: From the Neo4j Community: Best of May 2014

Celebrate American independence and freedom from table structures with these four blog posts.

Happy Fourth of July and happy graphDB reading from the Neo4j team!

Blog: Hierarchical Pattern Recognition by Kenny Bastani

Blog: Neo4j: Set Based Operations with the experimental Cypher optimiser by Mark Needham

Blog: Scaling Concurrent Writes in Neo4j by Max De Marzi

Blog: Using LoadCSV to Import Data from Google Spreadsheet by Rik Van Bruggen

[BONUS] GraphGist: Recruitment Graph Model by GraphAware

[BONUS] Video: Visualization of a Deep Learning Algorithm for Mining Patterns in Data by Kenny Bastani

The post Let Graph-dom Ring! Four GraphDB Reads for the Fourth of July appeared first on Neo4j Graph Database.

↧

From the Neo4j Community: Best of June 2014

July 7, 2014, 11:28 am

≫ Next: OSCON Twitter Graph

≪ Previous: Let Graph-dom Ring! Four GraphDB Reads for the Fourth of July

The Neo4j community once again posted tons of graph-tastic stuff this past month from awesome articles to great GraphGists. Here are a few of our favorites from the Neo4j community in June: ARTICLES

How Businesses Can Adopt a Google-style Approach to Understand Big Data by Emil Eifrem

How the Graph is Driving Tech City Start-ups by Jim Webber

Aggregate by different functions and join results into one data frame by Mark Needham

Creating a Graph with Cypher by mteasdal

Experimenting with Explaining Neo4j on a Whiteboard by Rik Van Bruggen

Experiments with NEO4J: Using a graph database as a SQL Server metadata hub by David Poole

Extracting Your LinkedIn Connections Into Neo4j Graph Database by Greg Dziemidowicz

Find JPA Entities without Field Access by Aparna Chaudhary

Graph Databases Find Answers for the Sick and their Healers by Joab Jackson

A Graph Database Should Be On Your Technology Radar for the Next Application You Build by Amir Khawaja

Hierarchical Pattern Recognition by Kenny Bastani

Importing CSV Data into Neo4j to Make a Graph by Samantha Zeitlin

Modelling the TOUR DE FRANCE 2014 in a Neo4j Graph Database by Lorenzo Speranzoni

Neo4j Unit Testing with Graph Unit by Aldrin Misquitta and Luanne Misquitta

Neo4j’s Cypher Vs. Clojure — Group By and Sorting by Mark Needham

Rendering a Neo4j Database in UbiGraph by DZone

Scaling Concurrent Writes in Neo4j by Max De Marzi

Set Based Operations with the experimental Cypher optimiser by Mark Needham

Smooth Cypher with eXtended Objects: Neo4j and Cypher-DSL by Lars Martin

Using AsciiArt to Analyse your SourceCode with Neo4j by Michael Hunger

GRAPHGISTS

Device Troubleshooting using a Graph Database by Ravi Pappu

Elite: Dangerous Trading by Rickard Oberg

A GraphGist of GraphGists by May Lim

The Recruitment Graph Model by GraphAware

The Feed is King (or Queen) by Aran Mulholland

VIDEOS

Getting Started with Neo4j, Ruby 2.1.2, and Rails 4.1.1 by Ben Morgan

Visualization of a Deep Learning Algorithm for Mining Patterns in Data by Kenny Bastani

Building Neo4j Backed Web Applications by Axel Morgner

The post From the Neo4j Community: Best of June 2014 appeared first on Neo4j Graph Database.

↧

OSCON Twitter Graph

July 23, 2014, 9:00 am

≫ Next: What Can Banks Learn from Online Dating

≪ Previous: From the Neo4j Community: Best of June 2014

OSCON Twitter Graph

As a part of Neo4j’s community engagement around OSCON, we wanted to look at the social media activity of the attendees on Twitter. Working with the Twitter Search API and searching for mentions of “OSCON”, we wanted to create a graph of Users, Tweets, Hashtags and shared Links.

The Twitter Search API returns a list of tweets matching a supplied search term. We then populated the graph model that is shown above by representing the results as nodes and relationships, achieved through using Neo4j’s query language, Cypher. We designed a single Cypher query to import each tweet into the graph model in Neo4j. This is achieved using a single parameter that contains all of the tweets returned from Twitter’s Search API. Using the UNWIND clause we are able to pivot a collection of tweets into a set of rows containing information about each tweet, which can then be structured into the outlined graph model from the image.

UNWIND {tweets} AS t
MERGE (tweet:Tweet {id:t.id})
SET tweet.text = t.text,
tweet.created_at = t.created_at,
tweet.favorites = t.favorite_count
MERGE (user:User {screen_name:t.user.screen_name})
SET user.profile_image_url = t.user.profile_image_url
MERGE (user)-[:POSTS]->(tweet)
FOREACH (h IN t.entities.hashtags |
    MERGE (tag:Hashtag {name:LOWER(h.text)})
    MERGE (tag)-[:TAGS]->(tweet)
)
… source, mentions, links, retweets, ...

We used this Cypher query to continuously poll the Twitter API on a regular interval, expanding our graph from the results of each search. At the time of writing this we have imported the following data:

Labels	Count
Tweet	10653
User	4910
Link	1153
Hashtag	742
Source	175

With this, we are able to answer many interesting questions about Twitter users at OSCON. For example, which platform are users tweeting from most often?

MATCH (t:Tweet)-[:USING]->(s:Source)
RETURN s.name as Source, count(t) as Count
ORDER BY Count DESC
LIMIT 5

Source	Count
Twitter Web Client	2294
Twitter for iPhone	1712
Twitter for Android	1590
TweetDeck	877
Hootsuite	668

Which hashtags co-occur with #python most frequently?

MATCH (:Hashtag {name:'python'})-[:TAGS]->(:Tweet)<-[:TAGS]-(h:Hashtag)
WHERE h.name <> 'oscon'
RETURN h.name AS Hashtag, COUNT(*) AS Count
ORDER BY Count DESC
LIMIT 5

Hashtag	Count
java	7
opensource	5
data	5
golang	5
nodejs	5

Which other topics could we recommend for a specific user? Finding the most frequently co-occurring topics to the ones they used and that they haven’t used themselves.

MATCH (u:User {screen_name:"mojavelinux"})-[:POSTS]->(tweet)
    <-[:TAGS]-(tag1:Hashtag)-[:TAGS]->(tweet2)<-[:TAGS]-(tag2:Hashtag)
WHERE tag1.name <> 'oscon' AND tag2.name <> 'oscon'
AND NOT (u)-[:POSTS]->()<-[:TAGS]-(tag2)
RETURN tag2.name as Topics, count(*) as Count
ORDER BY count(*) DESC LIMIT 5

Topics	Count
graphdb	30
graphviz	24
rstats	21
alchemyjs	21
cassandra	21

Which tweet has been retweeted the most, and who posted it?

MATCH (:Tweet)-[:RETWEETS]->(t:Tweet)
WITH t, COUNT(*) AS Retweets
ORDER BY Retweets DESC
LIMIT 1
MATCH (u:User)-[:POSTS]->(t)
RETURN u.screen_name AS User, t.text AS Tweet, Retweets

User	Tweet	Retweets
andypiper	Wise words #oscon http://t.co/f4Jr9hnMcV	470

Wise words #oscon pic.twitter.com/f4Jr9hnMcV — Andy Piper (@andypiper) July 20, 2014

To test your own queries on this graph model, check out our GraphGist.

Graph Visualization

The interesting aspect of this tweet-graph is that it contains the implicit connections between users via their shared hash tags, mentions and links. This graph differs from the “official” followers graph that Twitter makes explicit. Via the inferred connections we can discover new groups of people or topics we could be interested in. So we wanted to visualize this aspect of our graph on the big screen. We wrote a tiny python application that queries Neo4j for connections between people and tags (skipping the tweets in between) and makes the data available to a JavaScript front-end. The query takes the last 2000 tweets to analyze, follows the paths to tags and mentioned users and returns 1000 tuples of users connect to a tag or user to keep it manageable in the visualization.

MATCH (t:Tweet)
WITH t ORDER BY t.id DESC LIMIT 2000
MATCH (user:User)-[:POSTS]->(t)<-[:TAGS]-(tag:Hashtag)
MATCH (t)-[:MENTIONS]->(user2:User)  
UNWIND [tag,user2] as other WITH distinct user,other
WHERE lower(other.name) <> 'oscon'  
RETURN { from: {id:id(user),label: head(labels(user)), data: user},
    rel: 'CONNECTS',
    to: {id: id(other), label: head(labels(other)), data: other}} as tuple
LIMIT 1000

The front-end then uses VivaGraphJS, a WebGL enabled graph rendering library to render the Twitter activity graph of OSCON attendees. We use the Twitter images and hash tag representations to visualize nodes.

The post OSCON Twitter Graph appeared first on Neo4j Graph Database.

↧

What Can Banks Learn from Online Dating

August 1, 2014, 11:18 pm

≫ Next: Building a Python Web Application Using Flask and Neo4j

≪ Previous: OSCON Twitter Graph

Neo4j Co-Founder and GraphConnect speaker discusses the role of Graph databases in the future of finance

Originally posted on Wired.com Written by CEO of Neo Technology, Emil Eifrem

At first glance, the idea that the banking or finance sector could learn a trick or two from the online dating industry is laughable. After all, while the former is heavily regulated, deeply complex and integral to our economy; the latter is frivolous by comparison. Dating, as is often said, is a numbers game! And organizations such as Match.com, eHarmony and Zoosk rely on very sophisticated technology as they sift through vast customer bases to create the most compatible couples. Specially, they rely on data to build the most nuanced portraits of their members that they can, so they can find the best matches. This is a business-critical activity for dating sites — the more successful the matching, the better revenues will be. One of the ways they do this is through graph databases. These differ from relational databases — as conventional business databases are called — as they specialize in identifying the relationships between multiple data points. This means they can query and display connections between people, preferences and interests very quickly.

Applying Dating Insights to the Financial Sector

So where do financial institutions come in? Dating sites have put graph databases to such effective use because they are very good at modelling social relationships, and it turns out that understanding people’s relationships is a far better indicator of a match than a purely statistical analysis of their tastes and interests. The same is also true of financial fraud. The finance and banking sector lose billions of dollars each year as a result of fraud. While security measure such as the Address Verification Service and online tools such as Verified by Visa do help prevent some losses, fraudsters are becoming increasingly sophisticated in their approach. Over the last few years “First-Party”fraud has become a serious threat to banking — and it is very difficult to detect using standard methods. The fraudsters behave very similarly to legitimate customers, right up until the moment they clear their accounts and disappear. One of the features of first-party fraud is the exponential relationship between the number of individuals involved and the overall currency value being stolen. For example, 10 fraudsters can create 100 false identities sharing 10 elements between them (name, date of birth, phone number, address etc.). It is easy for a small group of fraudsters to use these elements to invent identities which to banks look utterly genuine. The ability to maximize the “take” by involving more people makes first party fraud particularly attractive to organized crime. The involvement of networks of individuals actually makes the job of investigation easier, however.

The ‘Social Network’ Analysis

Graph databases allow financial institutions to identify these fraud rings through connected “social network” analysis. This involves exploring and identifying any connections between customers before looking at their spending patterns. These operations are very difficult for conventional bank databases to explore as the relational database technology they are built in is designed to identify values, rather than explore relationships within the data. Importantly, taking new insights from the connections between data does not necessarily require gathering new data. Instead, by reframing the issue within a graph database financial institutions are able to flag advanced fraud scenarios as they are happening, rather than after the fact. It therefore follows that the very same “social graphs” that dating sites use to find matches between people, also represent a significant advance in the fight back against fraud, where traditional methods fall short. In the same way that graph databases outperform their relational counterparts in mapping out social networks, they can also be put to work in other contexts, too – as recommendation engines, supporting complex logistics or business processes, or as customer relationship management tools. From fraud rings and educated criminals operating on their own to lonely-hearts searching for love — graph databases provide a unique ability to discover new patterns within hugely complex volumes of data, in real time. Ultimately, in either case it can save the businesses time and money and offer a competitive advantage — something that any bank is sure to love.

GraphConnect 2014

Emil Eifrem is founder of the Neo4j open source graph database project. He will be speaking on the subject at GraphConnect 2014, the world’s only conference focused on the topic of graph databases. It will be held on October 22 in San Francisco, and will feature speakers from Neo Technology, eBay, CrunchBase, Elementum, Polyvore, ConocoPhillips and more. Visit GraphConnect.com for more information. Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today. Download My Ebook

The post What Can Banks Learn from Online Dating appeared first on Neo4j Graph Database.

↧

Building a Python Web Application Using Flask and Neo4j

January 23, 2015, 12:07 pm

≫ Next: (March Madness)

≪ Previous: What Can Banks Learn from Online Dating

Flask, a popular Python web framework, has many tutorials available online which use an SQL database to store information about the website’s users and their activities.

While SQL is a great tool for storing information such as usernames and passwords, it is not so great at allowing you to find connections among your users for the purposes of enhancing your website’s social experience.

The quickstart Flask tutorial builds a microblog application using SQLite.

In my tutorial, I walk through an expanded, Neo4j-powered version of this microblog application that uses py2neo, one of Neo4j’s Python drivers, to build social aspects into the application. This includes recommending similar users to the logged-in user, along with displaying similarities between two users when one user visits another user’s profile.

My microblog application consists of Users, Posts, and Tags modeled in Neo4j:

With this graph model, it is easy to ask questions such as:

“What are the top tags of posts that I’ve liked?”

MATCH (me:User)-[:LIKED]->(post:Post)<-[:TAGGED]-(tag:Tag)
WHERE me.username = 'nicole'
RETURN tag.name, COUNT(*) AS count
ORDER BY count DESC

“Which user is most similar to me based on tags we’ve both posted about?”

MATCH (me:User)-[:PUBLISHED]->(:Post)<-[:TAGGED]-(tag:Tag), 
(other:User)-[:PUBLISHED]->(:Post)<-[:TAGGED]-(tag)
WHERE me.username = 'nicole' AND me <> other
WITH other,

COLLECT(DISTINCT tag.name) AS tags,

COUNT(DISTINCT tag) AS len ORDER BY len DESC LIMIT 3 RETURN other.username AS similar_user, tags
Links to the full walkthrough of the application and the complete code are below.

Watch the Webinar:

Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Copy

The post Building a Python Web Application Using Flask and Neo4j appeared first on Neo4j Graph Database.

↧

(March Madness)

April 2, 2015, 1:36 am

≫ Next: JCypher: Focus on Your Domain Model, Not How to Map It to the Database [Community Post]

≪ Previous: Building a Python Web Application Using Flask and Neo4j

March GRAPHness

Download all the code needed to try it out for yourself HERE, or check out the GraphGist HERE.

March madness is a rare concord of well-documented data and pop culture. Warren Buffet’s billion-dollar bet grabbed the interest of everyone from Wall St. quants to Silicon Valley engineers to arm chair Money Ballers everywhere, and suddenly it paid off to be a big data geek.

It’s All Relative

To me, basketball is all about relationships — there are of course teams that are unambiguously better than others. However, there nearly always some sort of relative performance bias.

Where a team performs better or worse than their average performance would project due to some confluence of factors, whether it’s a team with a infamously brutal crowd of fans, a Point Guard that dissects your league-leading zone, or a decades-long rivalry that motivates your players to dig just a little more.

Performance is relative. These statistics are difficult to track across a single season and often incredibly difficult to track across time.

Secondly, being able to iterate on that model is taxing both in terms of writing the queries and in maintaining any reasonable performance on commodity hardware. I had a mountain of data from the past four seasons, including points scored, location, date, etc. etc.

We could easily add more granular information or more historic data, but for no particular statistical reason and only because it made my life easier, I decided that in my model these relationships should churn almost entirely every four years (as current players graduate and move on).

Finally, we’re going to build our “win power” relationship between teams as a function of the Pythagorean Expectation model (More on that later).

STEP 1: Idea —> Graph Model

I am not a clever boy. However, I have several clever tools at my disposal.
The most chief of which is Neo4j. So, I started as I do all of my graphy projects — with the questions I planned to ask most frequently and a whiteboard (or a piece of paper in this case).

Which became…

Which is a totally reasonable graph model for me to import data against.

STEP 2: Time

Before I loaded any data into Neo4j, I first needed to create the time-tree seen in the above model. One of Neo4j’s brilliant engineers (Thanks Mark!) did the heavy lifting for me and wrote a short Cypher snippet to generate the time-model I needed.

The result is something like this:

STEP 3: my.csv —> graph.db

Neo4j ships with a very powerful ETL tool called “LOAD CSV.” We’re going to use that.

I downloaded a mess of NCAA scores, then surreptitiously converted the data I downloaded from Excel spreadsheets into CSV format. I’ve hosted them in a public Dropbox found in the repo link above.

We’re bringing in several CSV files, each one representing a given season and then sewing that all together based on team names.

STEP 4: History, Victory and a Little Math

I’ve decided to create a relationship between each team called :WINPOWER based on what’s called concept from baseball called Pythagorean Expectation.

:WINPOWER essentially assigns a win probability based on points scored vs. points allowed. I added in a decay factor to weigh more recent games more heavily than those played long ago.

STEP 5: The Big Payout

Who should win between Navy and Michigan St.?

We see that our algorithm predicts (correctly!) that Michigan St. will defeat Navy:

Well…but what if they’ve never played each other? We can use the other teams they both played in common to determine a winPower:

We see that Kentucky should (and did) beat Hampton!

// kvg Want to learn more about graph databases? Click below to get your free copy of O’Reilly’s Graph Databases ebook and discover how to use graph technologies for your application today.

Download My Free Copy

The post (March Madness) <-[:MADE_SANE_WITH]- (Neo4j) appeared first on Neo4j Graph Database.

↧

JCypher: Focus on Your Domain Model, Not How to Map It to the Database [Community Post]

July 7, 2015, 4:00 am

≫ Next: Interview: Monitor Network Interdependencies with Neo4j

≪ Previous: (March Madness)

JCypher Allows You to Focus on Your Domain Model Instead of Mapping It to the Database

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Software developers around the world spend a significant amount of their time struggling with database related problems instead of concentrating on the implementation of domain models and business logic. The idea of orthogonal persistence together with approaches of modern ORMs (Object Relational Mappers) have eased this pain to some degree, but when it comes to performing queries on connected data, there is still no way around hand-crafted design of database structures, almost on a per query basis.

Introducing JCypher

JCypher, utilizing the power of Neo4j graph databases, aims to bring that long-given promise one big step closer to reality. This Java open source project (hosted on GitHub) allows you to concentrate on your domain model instead of how to map it to a database, at the same time enabling you to execute powerful queries upon your model with high performance. JCypher provides seamlessly integrated Java access to graph databases (like Neo4j) at different levels of abstraction. Let’s look at those layers from the top down:

Business Domains

At the topmost level of abstraction, JCypher allows you to map complex business domains to graph databases. You can take an arbitrarily complex graph of domain objects or POJOs (plain old Java objects) and store it in a straightforward way into a graph database for later retrieval. You do not need to modify your domain object classes in any way. You do not even need to add annotations. Moreover, JCypher provides a default mapping so you don’t have to write a single line of mapping code or mapping configuration.

Domain Queries

At the same level of abstraction, “Domain Queries” provide the power and expressiveness of queries on a graph database, while being formulated on domain objects or on types of domain objects, respectively. The true power of Domain Queries comes from the fact, that the graph of domain objects is backed by a graph database.

Generic Graph Model

At the next lower level of abstraction – access to graph databases – is provided based on a generic graph model. While simple, the model allows you to easily navigate and manipulate graphs. The model consists of nodes, relations and paths, together with properties, labels and types.

Native Java Domain-Specific Language

At the bottom level of abstraction, a “native Java DSL” in the form of a fluent Java API allows you to intuitively and comfortably formulate queries against graph databases. The DSL (or Domain Specific Language) is based on the Cypher query language. (The Cypher query language is developed as part of the Neo4j graph database developed by Neo Technology). The DSL provides all the power and expressiveness of the Cypher language. Hence the name, JCypher. Additionally, JCypher provides database access in a uniform way to remote as well as to embedded databases (including in-memory databases). For more information on JCypher, visit the project homepage and GitHub page. UPCOMING WEBINAR: Converting Tough SQL Queries into Easy Cypher Queries Register for this week’s webinar on 9 July 2015 at 9:00 a.m. Pacific (18:00 CEST) to learn how to transform non-performing relational queries into efficient Cypher statements or Neo4j extensions and achieve your required response times.

Save My Seat at the Webinar

The post JCypher: Focus on Your Domain Model, Not How to Map It to the Database [Community Post] appeared first on Neo4j Graph Database.

↧

Interview: Monitor Network Interdependencies with Neo4j

July 15, 2015, 4:00 am

≫ Next: Graph Databases for Beginners: Why Data Relationships Matter

≪ Previous: JCypher: Focus on Your Domain Model, Not How to Map It to the Database [Community Post]

Read This Interview to Learn How to Monitor Network Interdependencies Using Graph Databases

[This article is excerpted from a white paper by EMA and is used with permission.]

Traditional relational databases served the IT industry well in the past. Yet, in most deployments today they demand significant overhead and expert levels of administration to adapt to change. The fact is, relational databases require cumbersome indexing when faced with the non-hierarchic relationships that are becoming all too common in complex IT ecosystems as well as in dynamic infrastructures associated with cloud and agile. So, how does this affect your ability to monitor network interdependencies (and react accordingly)? If you’re still relying on relational databases for data center and network management, your organization will be caught in the past. However, with a graph database, you’re more prepared than ever to manage and monitor dependences in your network even as requirements and available technology changes. Graph databases like Neo4j make it easier to evolve models of real-world infrastructures, business services, social relationships or business behaviors that are both fluid and multi-dimensional. Your network data is already a graph, and with a graph database, you can more intuitively manage those interconnected relationships. Neo4j is built to support high-performance graph queries on large data sets for large enterprises with high-availability requirements. It includes its own graph query language, and uses native graph processing and a storage system natively optimized for graphs. As the second post of a two-part series on Neo4j and network management, we’ve interviewed a software consultant who is working with a large European telecommunications provider to manage and monitor network interdependencies.

Can you tell me a little bit more about you and your organization?

My firm is a software consultancy and I work closely with many Neo4j deployments with a focus on modeling, problem solving and innovation. I see some distinctive advantages in graph databases, and in particular, in Neo4j’s offering.

Can you share more specifically how you view those advantages?

The graph model is unique with its ability to accommodate highly connected, partially structured datasets that can evolve over time in terms of complexity and structure. Graphs are also naturally capable of providing a wide range of evolvable ad-hoc queries on top of these datasets. This not only makes for much improved flexibility in design. It also enables relationships to be easily captured that are unsuited to traditional hierarchic models. It also allows for much better adaptability to changes when the changes themselves are less predictable or not strictly hierarchic in nature. One of the things I especially appreciate is that Neo4j makes it simple to model real-life or business situations – it provides a much better working foundation for key stakeholders who are not necessarily technical.

Can you tell me a little more about the requirements of the deployment you did for a large telecommunications provider?

This company had a very large complex network with many silos and processes – including network management information spread across more than thirty systems. The large number of data sources was in part due to network complexity, and in part due to different business units, as well as organic growth through mergers and acquisitions. These different sources also created a very non-linear fabric that had to be modeled and understood from various dimensions. Previous to Neo4j, they had different network layers stored in different systems – for instance, one system might be dedicated to cell towers, another fiber cables and another devoted to information about consumers or enterprise customers. The company needed a way to predict and warn customers in advance of any service interruptions in order to maintain customer service agreements and avoid financial penalties due to unplanned downtime. With daily changes required to optimize the network infrastructure, managing this effectively was definitely a challenge. One of their business process challenges was around maintenance and ensuring redundancy – they needed to know if they took a device down for maintenance, exactly who might be impacted and what the penalties might be, as well as what alternate routes might better mitigate the impact. There was also a more proactive, planning requirement – e.g. planning to lay an alternate cable for backup and knowing how things are connected so best-case alternate paths can be identified. What are all the upstream interdependencies? Downstream interdependencies? etc.

How did you get involved?

This company had some choices between Neo4j and some very rigid and expensive tools designed to fit specific needs. For instance they already had an impact analysis system from which they were extracting spreadsheets and a team of about ten people doing manual work on the spreadsheets, which is expensive and error-prone. But a small team at that company did a proof of concept with Neo4j and felt that it had many advantages – both in terms of immediate benefits and potential – given the graph nature of many network interdependencies across various processes. Once the POC team showed some initial potential, they got the buy-in to move forward to next steps, and I came on board.

How many people were there on the Proof of Concept Team? And how did the deployment evolve?

There were only three: two developers plus the project manager. It only took a few months to show the benefits. As the deployment evolved, we added someone to support needed integrations. Within four to six months we were able to match the pre-existing system and to demonstrate benefits and advantages. These included fast and powerful queries, along with a custom visualization module. Then we proceeded to take the next steps to support more complex analysis for root cause – e.g. if you do this or if this occurs, it will cause this specific problem. Or conversely, This is the reason that you experienced this problem. All along the way there was fierce competition to show value, as this telecommunications provider was very serious about managing its costs. One of the things I like best about Neo4j is that it supports incremental development. You don’t have to get all the data at once to get value from it. You can build your graph in an incremental way, as opposed to more rigid approaches, and then add other layers to accommodate more data and more complex or new relationships. It was almost a dream business case because you could measure the benefit of the project as the telecommunications provider began to manage production-level changes that impacted its many actual customers. Every time they got something wrong there were immediate costs in penalties. And the values were huge.

What were some of the other benefits that the Neo4j deployment achieved there?

After implementation of the model and the impact analysis queries, it was easy to extend the application to support single point of failure detection thanks to the flexibility of the graph model. Also, by providing an effectively unified cross-domain view, experts from different silos could work together for the first time and agree on a common domain terminology. Read the first post of our two-part series on Neo4j and network management here. Dive deeper into how graph databases transform your ability to monitor network interdependencies – click below to download this white paper, How Graph Databases Solve Problems in Network & Data Center Management, and start solving your IT challenges with graph databases. Download My White Paper

The post Interview: Monitor Network Interdependencies with Neo4j appeared first on Neo4j Graph Database.

↧

Graph Databases for Beginners: Why Data Relationships Matter

July 31, 2015, 4:00 am

≫ Next: Graph Databases for Beginners: Data Modeling Pitfalls to Avoid

≪ Previous: Interview: Monitor Network Interdependencies with Neo4j

We live in an ever-more-connected world, and data relationships, will only increase in years to come. If your business is to succeed in a highly connected world, you must learn to leverage those connections for all they’re worth – but you’ll need the right technology. With so many systems built on relational databases or aggregate NoSQL stores, you may not know of a third option that outperforms them both: graph databases. In this “Graph Databases for Beginners” blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. Last week, we tackled why graphs are the future. This week, we’ll discuss why data relationships matter when choosing a database.

The Irony of Relational Databases

Relational databases (RDBMS) were originally designed to codify paper forms and tabular structures, and they still do this exceedingly well. Ironically, however, relational databases aren’t effective at handling data relationships, especially when those relationships are added or adjusted on an ad hoc basis. The greatest weakness of relational databases is that their schema is too inflexible. Your business needs are constantly changing and evolving, but the schema of a relational database can’t efficiently keep up with those dynamic and uncertain variables. To compensate, your development team can try to leave certain columns empty (tech lingo: nullable), but this approach requires more code to handle the greater number of exceptions in your data. Even worse, as your data multiplies in complexity and diversity, your relational database becomes burdened with large join tables which disrupt performance and hinder further development. Consider the sample relational database below. In order to discover what products a customer bought, your developers would need to write several join tables which significantly slow the performance of the application. Furthermore, asking a reciprocal question like, “Which customers bought this product?” or “Which customers buying this product also bought that product?” becomes prohibitively expensive. Yet, questions like these are essential if you want to build a proper recommendation engine for your transactional application.

Discover Why Data Relationships Matter in this Graph Databases for Beginners Blog Series

An example relational database where some queries are inefficient-yet-doable (e.g., “What items did a customer buy?”) and other queries are prohibitively slow (e.g., “Which customers bought this product?”).

At a certain point, your business needs will entirely outgrow your current database schema. The problem, however, is that migrating your data to a new scheme becomes incredibly effort-intensive.

Why NoSQL Databases Don’t Fix the Problem Either

NoSQL (or Not only SQL) databases store sets of disconnected documents, values and columns, which in some ways gives them a performance advantage over relational databases. However, their disconnected construction makes it harder to harness data relationships properly. Some developers add data relationships to NoSQL databases by embedding aggregate identifying information inside the field of another aggregate (tech lingo: they use foreign keys). But joining aggregates at the application level later becomes just as prohibitively expensive as in a relational database. These foreign keys have another weak point too: they only “point” in one direction, making reciprocal queries too time-consuming to run. Developers usually work around this problem by inserting backward-pointing relationships or by exporting the dataset to an external compute structure, like Hadoop, and computing the result with brute force. Either way, the results are slow and latent.

Graphs Put Data Relationships at the Center

When you want a cohesive picture of your big data, including the connections between elements, you need a graph database. In contrast to relational and NoSQL databases, graph databases store data relationships as relationships. This explicit storage of relationship data means fewer disconnects between your evolving schema and your actual database. In fact, the flexibility of a graph model allows you to add new nodes and relationships without compromising your existing network or expensively migrating your data. All of your original data (and its original relationships) remain intact. With data relationships at their center, graph databases are incredibly efficient when it comes to query speeds, even for deep and complex queries. In their book Neo4j in Action, Partner and Vukotic performed an experiment between a relational database and a graph database (Neo4j!). Their experiment used a basic social network to find friends-of-friends connections to a depth of five degrees. Their dataset included 1,000,000 people each with approximately 50 friends. The results of their experiment are listed in the table below.

A performance experiment run between relational databases (RDBMS) and Neo4j shows that graph databases handle data relationships extremely efficiently.

At the friends-of-friends level (depth two), both the relational database and graph database performed adequately. However, as the depth of connectedness increased, the performance of the graph database quickly outstripped that of the relational database. It turns out data relationships are vitally important. This comparison isn’t to say NoSQL stores or relational databases don’t have a role to play (they certainly do), but they fall short when it comes to connected data relationships. Graphs, however, are extremely effective at handling connected data. Want to dive deeper into the world of graph databases? Learn how to apply graph technologies to mission-critical problems with O’Reilly’s Graph Databases. Click below to get your free copy of the definitive book on graph databases and your introduction to Neo4j.

Get My Free Copy

Catch up with the rest of the Graph Databases for Beginners series:

Graph Databases for Beginners: Why Graphs Are the Future

Graph Databases for Beginners: The Basics of Data Modeling

Graph Databases for Beginners: Data Modeling Pitfalls to Avoid

Graph Databases for Beginners: Why a Database Query Language Matters

Graph Databases for Beginners: Why We Need NoSQL Databases

Graph Databases for Beginners: ACID vs. BASE Explained

Graph Databases for Beginners: A Tour of Aggregate Stores

Graph Databases for Beginners: Other Graph Data Technologies

The post Graph Databases for Beginners: Why Data Relationships Matter appeared first on Neo4j Graph Database.

↧

Graph Databases for Beginners: Data Modeling Pitfalls to Avoid

August 12, 2015, 4:00 am

≫ Next: 5 Secrets to More Effective Neo4j 2.2 Query Tuning

≪ Previous: Graph Databases for Beginners: Why Data Relationships Matter

With the advent of graph databases, data modeling has become accessible to masses. Mapping business needs into a well-defined structure for data storage and organization has made a sortie du temple (of sorts) from the realm of the well-educated few to the province of the proletariat. No longer the sole domain of senior DBAs and principal developers, anyone with a basic understanding of graphs can complete a rudimentary data model – from the CEO to the intern. (This doesn’t mean we don’t still need expert data modelers. If you’re a data modeling vet, here’s your more advanced introduction to graph data modeling.) Yet, with greater ease and accessibility comes an equal likelihood that data modeling might go wrong. And if your data model is weak, your entire application will be too. In this “Graph Databases for Beginners” blog series, I’ll take you through the basics of graph technology assuming you have little (or no) background in the space. In past weeks, we’ve tackled why graphs are the future, why data relationships matter and how graph databases make data modeling easier than ever, especially for the uninitiated. This week, we’ll discuss how to avoid the most common (and fatal) mistakes when completing your data model.

Example Data Model: Fraud Detection in Email Communications

Graph databases are highly expressive when it comes to data modeling for complex problems. But expressivity isn’t a guarantee that you’ll get your data model right on the first try. Even graph database experts make mistakes and beginners are bound to make even more. Let’s dive into an example data model to witness the most common mistakes (and their consequences) so you don’t have to learn from the same errors in your own data model. In this example, we’ll examine a fraud detection application that analyzes users’ email communications. This particular application is looking for rogue behavior and suspicious emailing patterns that might indicate illegal or unethical behavior. We’re particularly looking for patterns from past wrongdoers, such as frequently using blind-copying (BCC) and using aliases to conduct fake “conversations” that mimic legitimate interactions. In order to catch this sort of unscrupulous behavior, we’ll need a graph data model that captures all the relevant elements and activities. For our first attempt at the data model, we’ll map some users, their activities and their known aliases, including a relationship describing Alice as one of Bob’s known aliases. The result is a star-shaped graph with Bob in the center.

Learn to Avoid These Common Data Modeling Pitfalls in This Graph Databases for Beginners Blog Series

Our first data model attempting to map Bob’s suspicious email activity with Alice as a known alias. However, this data model isn’t robust enough to detect wrongful behavior.

At first blush, this initial data modeling attempt looks like an accurate representation of Bob’s email activity; after all, we can easily see that Bob (an alias of Alice) emailed Charlie while BCC’ing Edward and CC’ing Davina. But we can’t see the most important part of all: the email itself. A beginning data modeler might try to remedy the situation by adding properties to the EMAILED relationship, representing the email’s attributes as properties. However, that’s not a long-term solution. Even with properties attached to each EMAILED relationship, we wouldn’t be able to correlate connections between EMAILED, CC and BCC relationships – and those correlating relationships are exactly what we need for our fraud detection solution. This is the perfect example of a common data modeling mistake. In everyday English, it’s easy and convenient to shorten the phrase “Bob sent an email to Charlie” to “Bob emailed Charlie.” This shortcut made us focus on the verb “emailed” rather than the email as an object itself. As a result, our incomplete model keeps us from the insights we’re looking for.

The Fix: A Stronger Fraud Detection Data Model

To fix our weak model, we need to add nodes to our graph model that represent each of the emails exchanged. Then, we need to add new relationships to track who wrote the email and to whom it was sent, CC’ed and BCC’ed. The result is another star-shaped graph, but this time the email is at the center, allowing us to efficiently track its relationship to Bob and possibly some suspicious behavior.

An Email Fraud Detection Graph with the Email Itself at the Center

Our second attempt at a fraud detection data model. This iteration allows us to more easily trace the relationships of who is sending and receiving each email message.

Of course we aren’t interested in tracking just one email but many, each with its own web of interactions to explore. Over time, our email server logs more interactions giving us something like the graph below.

An Email Fraud Detection Graph with BCC and CC Metadata

A data model showing many emails over time and their various relationships, including the sender and the direct, CC and BCC receivers.

The Next Step: Tracking Email Replies

At this point, our data model is more robust, but it isn’t complete. We can see who sent and received emails, and we can see the content of the emails themselves. Nevertheless, we can’t track any replies or forwards of our given email communications. In the case of fraud or cybersecurity, we need to know if critical business information has been leaked or compromised. To complete this upgrade, beginners might be tempted to simply add FORWARDED and REPLIED_TO relationships to our graph model, like in the example below.

A Fraud Detection Email Graph Attempting to Account for Replies and Forwards

Our updated data model with FORWARDED and REPLIED_TO relationships in addition to the original TO relationship.

This approach, however, quickly proves inadequate. Much in the same way the EMAILED relationship didn’t give us the proper information, simply adding FORWARDED or REPLIED_TO relationships doesn’t give us the insights we’re really looking for. To build a better data model, we need to consider the fundamentals of this particular domain. A reply to an email is both a new email and a reply to the original. The two roles of a reply can be represented by attaching two labels – “Email” and “Reply” – to the appropriate node. We can then use the same TO, CC and BCC relationships to map whether the reply was sent to the original sender, all recipients or a subset of recipients. We can also reference the original email with a REPLY_TO relationship. The resulting graph data model is shown below.

An Email Fraud Detection Graph Showing Replies and Forwards

Not only can we see who replied to Bob’s original email, but we can track replies-to-replies and replies-to-replies-to-replies, and so on to an arbitrary depth. If we’re trying to track a suspicious number of replies to known aliases, the above graph data model makes this extremely simple.

Homework: Data Modeling for Email Forwards

Equally important to tracking email replies is tracking email forwards, especially when it comes to leaked business information. As a data modeling acolyte, your homework assignment is to document how you would model the forwarded email data, tracking the relationships with senders, direct recipients, CC’ed recipients, BCC’ed recipients and the original email. Check your work on pages 61 and 62 of the O’Reilly Graph Databases book available here. Data modeling has been made much easier with the advent of graph databases. However, while it’s simpler than ever to translate your whiteboard model into a physical one, you need to ensure your data model is designed effectively for your particular use case. There are no absolute rights or wrongs with graph data modeling, but you should avoid the pitfalls mentioned above in order to glean the most valuable insights from your data. Ready to sharpen your understanding of graph databases? Click below to get your free copy of the O’Reilly Graph Databases ebook and discover how to apply graph technologies to mission-critical problems at your enterprise.

Download My Free Copy

Catch up with the rest of the Graph Databases for Beginners series:

Graph Databases for Beginners: Why Graphs Are the Future

Graph Databases for Beginners: Why Data Relationships Matter

Graph Databases for Beginners: The Basics of Data Modeling

Graph Databases for Beginners: Why a Database Query Language Matters

Graph Databases for Beginners: Why We Need NoSQL Databases

Graph Databases for Beginners: ACID vs. BASE Explained

Graph Databases for Beginners: A Tour of Aggregate Stores

Graph Databases for Beginners: Other Graph Data Technologies

The post Graph Databases for Beginners: Data Modeling Pitfalls to Avoid appeared first on Neo4j Graph Database.

↧

5 Secrets to More Effective Neo4j 2.2 Query Tuning

September 25, 2015, 4:00 am

≫ Next: Graph Databases in the Enterprise: Graph-Based Search

≪ Previous: Graph Databases for Beginners: Data Modeling Pitfalls to Avoid

Learn Five Secrets to More Effective Query Tuning with Neo4j 2.2.x

Even in Neo4j with its high performance graph traversals, there are always queries that could and should be run faster – especially if your data is highly connected and if global pattern matches make even a single query account for many millions or billions of paths.

For this article, we’re using the larger movie dataset, which is also listed on the example datasets page.

The domain model that interests us here, is pretty straightforward:

(:Person {name}) -[:ACTS_IN|:DIRECTED]-> (:Movie {title})
(:Movie {title}) -[:GENRE]-> (:Genre {name})

HARDWARE

I presume you use a sensible machine, with a SSD (or enough IOPS) and decent amount of RAM. For a highly concurrent load, there should be also enough CPU cores to handle it.

Other questions to consider: Did you monitor io-waits, top, for CPU and memory usage? Any bottlenecks that turn up?

If so, you should address those issues first.

On Linux, configure your disk scheduler to noop or deadline and mount the database volume with noatime. See this blog post for more information.

CONFIG

For best results, use the latest stable version of Neo4j (i.e., Neo4j Enterprise 2.2.5). There is always an Enterprise trial version available to give you a high-watermark baseline, so compare it to Neo4j Community on your machine as needed.

Set dbms.pagecache.memory=4G in conf/neo4j.properties or the size of the store-files (nodes, relationships, properties, string-properties) combined.

ls -lt data/graph.db/neostore.*.db
3802094 16 Jul 14:31 data/graph.db/neostore.propertystore.db
 456960 16 Jul 14:31 data/graph.db/neostore.relationshipstore.db
 442260 16 Jul 14:31 data/graph.db/neostore.nodestore.db
   8192 16 Jul 14:31 data/graph.db/neostore.schemastore.db
   8190 16 Jul 14:31 data/graph.db/neostore.labeltokenstore.db
   8190 16 Jul 14:31 data/graph.db/neostore.relationshiptypestore.db
   8175 16 Jul 14:31 data/graph.db/neostore.relationshipgroupstore.db

Set heap from 8 to 16G, depending on the RAM size of the machine. Also configure the young generation in conf/neo4j-wrapper.conf.

wrapper.java.initmemory=8000
wrapper.java.maxmemory=8000
wrapper.java.additional=-Xmn2G

That’s mostly it, config-wise. If you are concurrency heavy, you could also set the webserver threads in conf/neo4j-server.properties.

# cpu * 2
org.neo4j.server.webserver.maxthreads=24

QUERY TUNING

If these previous factors are taken care of, it’s now time to dig into query tuning. A lot of query tuning is simply prefixing your statements with EXPLAIN to see what Cypher would do and using PROFILE to retrieve the real execution data as well:

For example, let’s look at this query, which has the PROFILE prefix:

PROFILE
MATCH(g:Genre {name:"Action"})<-[:GENRE]-(m:Movie)<-[:ACTS_IN]-(a)
WHERE a.name =~ "A.*"
RETURN distinct a.name;

The result of this query is shown below in the visual query plan tool available in the Neo4j browser.

A Screenshot of a Query Plan for More Effective Query Tuning

While the visual query plan is nice in the Neo4j browser, the one in Neo4j-shell is easier to compare and it also has more raw numbers.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 72310
Distinct	2048	860	2636	a.name	a.name
Filter(0)	2155	1318	41532	anon[32], anon[52], a, g, m	a.name ~= /{ AUTOSTRING1}/
Expand(All)(0)	2874	20766	23224	anon[32], anon[52], a, g, m	(m)←[:ACTS_IN]-(a)
Filter(1)	390	2458	2458	anon[32], g, m	m:Movie
Expand(All)(1)	390	2458	2459	anon[32], g, m	(g)←[:GENRE]-(m)
NodeUniqueIndexSeek	1	1	1	g	:Genre(name)

Query Tuning Tip #1: Use Indexes and Constraints for Nodes You Look Up by Properties

Check – with either schema or :schema – that there is an index in place for non-unique values and a constraint for unique values, and make sure – with EXPLAIN – that the index is used in your query.

CREATE INDEX ON :Movie(title);
CREATE INDEX ON :Person(name);
CREATE CONSTRAINT ON (g:Genre) ASSERT g.name IS UNIQUE;

Even for range queries (pre-Neo4j 2.3), it might be better to turn them into an IN query to leverage an index.

// if :Movie(released) is indexed, this query for the nineties will *not use* an index:
MATCH (m:Movie) WHERE m.released >= 1990 and m.released < 2000
RETURN count(*);

CREATE INDEX ON :Movie(released);

// but this will
MATCH (m:Movie) WHERE m.released IN range(1990,1999)  RETURN count(*);

// same for OR queries
MATCH (m:Movie) WHERE m.released = 1990 OR m.released = 1991 OR ...

Query Tuning Tip #2: Patterns with Bound Nodes are Optimized

If you have a pattern (node)-[:REL]→(node) where both nodes on either side are already bound, Cypher will optimize the match by taking the node-degree (number of relationships) into account when checking for the connection, starting on the smaller side and also caching internally.

So, for example, (actor)-[:ACTS_IN]->(movie) – if both actor and movie are known – turns into that described Expand(Into) operation.

If one side is not known, then it is a normal Expand(All) operation.

Query Tuning Tip #3: Enforce Index Lookups for Both Sides of a Path

Make sure that if nodes on both sides of a longer path can be found in an index, and are only a few hits of a larger total count, to add USING INDEX for both sides. In many cases, that makes a big difference. It doesn't help if the path explodes in the middle and a simple left-to-right traversal with property checks would touch fewer paths.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
RETURN count(*);

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 765
EagerAggregation	0	1	0	count(*)
Filter	0	3	531	anon[36], anon[49], anon[51], a, b	a:Person AND a.name == { AUTOSTRING0}) AND NOT(anon[36] == anon[51]
Expand(All)(0)	3	177	204	anon[36], anon[49], anon[51], a, b	()←[:ACTS_IN]-(a)
Expand(All)(1)	2	27	28	anon[49], anon[51], b	(b)-[:ACTS_IN]→()
NodeIndexSeek	1	1	2	b	:Person(name)

If we add the second index-hint, we get 10x fewer database hits.

PROFILE
MATCH (a:Person {name:"Tom Hanks"})-[:ACTS_IN]->()<-[:ACTS_IN]-(b:Person {name:"Meg Ryan"})
USING INDEX a:Person(name) USING INDEX b:Person(name)
RETURN count(*);

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 68
EagerAggregation	0	1	0	count(*)
Filter	0	3	0	anon[36], anon[49], anon[51], a, b	NOT(anon[36] == anon[51])
NodeHashJoin	0	3	0	anon[36], anon[49], anon[51], a, b	anon[49]
Expand(All)(0)	2	27	28	anon[49], anon[51], b	(b)-[:ACTS_IN]→()
NodeIndexSeek(0)	1	1	2	b	:Person(name)
Expand(All)(1)	2	35	36	anon[36], anon[49], a	(a)-[:ACTS_IN]→()
NodeIndexSeek(1)	1	1	2	a	:Person(name)

Query Tuning Tip #4: Defer Property Access

Make sure to access properties only as the last operation – if possible – and on the smallest set of nodes and relationships. Massive property loading is more expensive than following relationships.

For example, this query:

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
RETURN p.name, count(*) as c
ORDER BY c DESC limit 10;

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 404525
Projection(0)	308	10	0	anon[48], anon[54], c, p.name	anon[48]; anon[54]
Top	308	10	0	anon[48], anon[54]	{ AUTOINT0};
EagerAggregation	308	44689	0	anon[48], anon[54]	anon[48]
Projection(1)	94700	94700	189400	anon[48], anon[17], m, p	p.name
Filter	94700	94700	94700	anon[17], m, p	p:Person
Expand(All)	94700	94700	107562	anon[17], m, p	(m)←[:ACTS_IN]-(p)
NodeByLabelScan	12862	12862	12863	m	:Movie

This query shown above accesses p.name for all people, totaling 400,000 database hits. Instead, you should aggregate on the node first, then order and paginate, and only in the very end should you access and return the property.

PROFILE
MATCH (p:Person)-[:ACTS_IN]->(m:Movie)
WITH p, count(*) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This second query above only accesses p.name for the top ten actors, and before that, it groups them directly by the nodes, saving us about 200,000 database hits.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 215145
Projection	308	10	20	c, p, p.name	p.name; c
Top	308	10	0	c, p	{ AUTOINT0}; c
EagerAggregation	308	44943	0	c, p	p
Filter	94700	94700	94700	anon[17], m, p	p:Person
Expand(All)	94700	94700	107562	anon[17], m, p	(m)←[:ACTS_IN]-(p)
NodeByLabelScan	12862	12862	12863	m	:Movie

But that query could even be optimized more, with....

Query Tuning Tip #5: Fast Relationship Counting

There is an optimal implementation for single path-expressions, by directly reading the degree of a node. Personally, I always prefer this method over optional matches, exists or general where conditions: size((s)-[:REL]->()) ← uses get-degree which is a constant time operation (similarly without rel-type or direction).

PROFILE
MATCH (n:Person) WHERE EXISTS((n)-[:DIRECTED]->())
RETURN count(*);

Here the plan doesn’t count the nested db-hits in the expression, which it should. That’s why I included the runtime:

1 row 197 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 106396
EagerAggregation	194	1	56216	count(*)
Filter	37634	6037	0	n	NestedPipeExpression(ExpandAllPipe(….))
NodeByLabelScan	50179	50179	50180	n

versus

PROFILE
MATCH (n:Person) WHERE size((n)-[:DIRECTED]->()) <> 0
RETURN count(*);

1 row 90 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 150538
EagerAggregation	213	1	0	count(*)
Filter	45161	6037	100358	n	NOT(GetDegree(n,Some(DIRECTED),OUTGOING) == { AUTOINT0})
NodeByLabelScan	50179	50179	50180	n	:Person

You can also use that technique nicely for overview pages or inline aggregations:

PROFILE
MATCH (m:Movie)
RETURN m.title, size((m)<-[:ACTS_IN]-()) as actors, size((m)<-[:DIRECTED]-()) as directors
LIMIT 10;

+-------------------------------------------------------------+
| m.title                                | actors | directors |
+-------------------------------------------------------------+
| "Indiana Jones and the Temple of Doom" | 13     | 1         |
| "King Kong"                            | 1      | 1         |
| "Stolen Kisses"                        | 21     | 1         |
| "One Flew Over The Cuckoo's Nest"      | 24     | 1         |
| "Ziemia obiecana"                      | 17     | 1         |
| "Scoop"                                | 21     | 1         |
| "Fire"                                 | 0      | 1         |
| "Dial M For Murder"                    | 5      | 1         |
| "Ed Wood"                              | 21     | 1         |
| "Requiem"                              | 11     | 1         |
+-------------------------------------------------------------+
10 rows
13 ms

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 71
Projection	12862	10	60	actors, directors,	m.title; GetDegree(m,Some(ACTS_IN),INCOMING);
				m, m.title	GetDegree(m,Some(DIRECTED),INCOMING)
Limit	12862	10	0	m	{ AUTOINT0}
NodeByLabelScan	12862	10	11	m	:Movie

Our query from the previous section would look like this:

PROFILE
MATCH (p:Person)
WITH p, sum(size((p)-[:ACTS_IN]->())) as c
ORDER BY c DESC LIMIT 10
RETURN p.name, c;

This query shaves off another 50,000 database hits. Not bad.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 150558
Projection	224	10	20	c, p, p.name	p.name; c
Top	224	10	0	c, p	{ AUTOINT0}; c
EagerAggregation	224	50179	100358	c, p	p
NodeByLabelScan	50179	50179	50180	p	:Person

Note to self: Optimized Cypher looks more like Lisp.

Bonus Query Tuning Tip: Reduce Cardinality of Work in Progress

When following longer paths, you’ll encounter duplicates. If you’re not interested in all the possible paths – but just distinct information from stages of the path – make sure that you eagerly eliminate duplicates, so that later matches don’t have to be executed many multiple times.

This reduction of the cardinality can be done using either WITH DISTINCT or WITH aggregation (which automatically de-duplicates).

So, for instance, for this query of "Movies that Tom Hanks' colleagues acted in":

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query has 10,272 db-hits and touches 3,020 total paths.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 10272
Distinct	4	2021	6040	m2.title	m2.title
Filter(0)	4	3020	0	anon[36], anon[53], anon[75], coActor, m1, m2, p	(NOT(anon[53] == anon[75]) AND NOT(anon[36] == anon[75]))
Expand(All)(0)	4	3388	3756	anon[36], anon[53], anon[75], coActor, m1, m2, p	(coActor)-[:ACTS_IN]→(m2)
Filter(1)	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

The first-degree neighborhood is unique, since in this dataset there is only at most one :ACTS_IN relationship between an actor and a movie. So, the first duplicated nodes appear at the second degree, which we can eliminate like this:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN distinct m2.title;

This query tuning technique reduces the number of paths to match for the last step to 2,906. In other use cases with more duplicates, the impact is much bigger.

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 9529
Distinct(0)	4	2031	5812	m2.title	m2.title
Expand(All)(0)	4	2906	3241	anon[113], coActor, m2	(coActor)-[:ACTS_IN]→(m2)
Distinct(1)	3	335	0	coActor	coActor
Filter	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

Of course we would apply our Minimize Property Access tip here too:

PROFILE
MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH distinct coActor
MATCH (coActor)-[:ACTS_IN]->(m2)
WITH distinct m2
RETURN m2.title;

Operator	Est.Rows	Rows	DbHits	Identifiers	Other
Total database accesses: 7791
Projection	4	2037	4074	m2, m2.title	m2.title
Distinct(0)	4	2037	0	m2	m2
Expand(All)(0)	4	2906	3241	anon[113], coActor, m2	(coActor)-[:ACTS_IN]→(m2)
Distinct(1)	3	335	0	coActor	coActor
Filter	3	368	0	anon[36], anon[53], coActor, m1, p	NOT(anon[36] == anon[53])
Expand(All)(1)	3	403	438	anon[36], anon[53], coActor, m1, p	(m1)←[:ACTS_IN]-(coActor)
Expand(All)(2)	2	35	36	anon[36], m1, p	(p)-[:ACTS_IN]→(m1)
NodeIndexSeek	1	1	2	p	:Person(name)

We still need the distinct m2 at the end, as the co-actors can have played in the same movies, and we don’t want duplicate results.

This query has 7,791 db-hits and touches 2,906 paths in total.

If you are also interested in the frequency (e.g., for scoring), you can compute them along with an aggregation instead of distinctly. In the end, You just multiply the path count per co-actor with the number of occurrences per movie.

MATCH (p:Person {name:"Tom Hanks"})-[:ACTS_IN]->(m1)<-[:ACTS_IN]-(coActor)
WITH coActor, count(*) as freq
MATCH (coActor)-[:ACTS_IN]->(m2)
RETURN m2.title, freq * count(*) as occurrence;

Conclusion

The best way to start with query tuning is to take the slowest queries, PROFILE them and optimize them using these tips.

If you need help, you can always reach out to us on Stack Overflow, our Google Group or our public Slack channel.

If you are part of a project that is adopting Neo4j or putting it into production, make sure to get some expert help to ensure you’re successful. Note: If you do ask for help, please provide enough information for others to be able to help you. Explain your graph model, share your queries, their profile output and – best of all – a dataset to run them on.

Need more tips on how to effectively use Neo4j? Register for our online training class, Neo4j in Production, and learn how to master the world’s leading graph database.

Sign Me Up

The post 5 Secrets to More Effective Neo4j 2.2 Query Tuning appeared first on Neo4j Graph Database.

↧

Graph Databases in the Enterprise: Graph-Based Search

November 9, 2015, 4:00 am

≫ Next: How Backstory.io Uses Neo4j to Graph the News [Community Post]

≪ Previous: 5 Secrets to More Effective Neo4j 2.2 Query Tuning

Learn More about the Graph-Based Search Use Case of Graph Databases in the Enterprise

Graph-based search is a new approach to data and digital asset management originally pioneered by Facebook and Google.

Search powered by a graph database delivers relevant information that you may not have specifically asked for – offering a more proactive and targeted search experience, allowing you to quickly triangulate the data points of the greatest interest.

The key to this enhanced search capability is that on the very first query, a graph-based search engine takes into account the entire structure of available connected data. And because graph systems understand how data is related, they return much richer and more precise results.

Think of graph-based search more as a “conversation” with your data, rather than a series of one-off searches. It’s search and discovery, rather than search and retrieval.

In this “Graph Databases in the Enterprise” series, we’ll explore the most impactful and profitable use cases of graph database technologies at the world’s leading organizations. In past weeks, we’ve examined fraud detection, real-time recommendation engines, master data management, network & IT operations and identity & access management (IAM).

This week, we’ll take a closer look at graph-based search.

The Key Challenges in Graph-Based Search:

As a cutting edge technology, graph-based search is beset with challenges. Here are some of the biggest:

The size and connectedness of asset metadata

Real-time query performance

power of a graph-based search application

Growing number of data nodes

Why Use a Graph Database for Graph-Based Search?

Graph-based search would be impossible without a graph database to power it.

In essence, graph-based search is intelligent: You can ask much more precise and useful questions and get back the most relevant and meaningful information, whereas traditional keyword-based search delivers results that are more random, diluted and low-quality.

With graph-based search, you can easily query all of your connected data in real time, then focus on the answers provided and launch new real-time searches prompted by the insights you’ve discovered.

Graph databases make advanced search-and-discovery possible because:

Enterprises can structure their data exactly as it occurs and carry out searches based on their own inherent structure. Graph databases provide the model and query language to support the natural structure of data.
Users receive fast, accurate search results in real time. With a graph database, a variety of rich metadata is assigned to all content for rapid search and discovery.
Data architects and developers can easily change their data and its structure as well as add a wide variety of new data. The built-in flexibility of a graph database model allows for agile changes to search capabilities.

In contrast, information held in a relational database is much more inflexible to future change: If you want to add new kinds of content or make structural changes, you are forced to re-work the relational model in a way that you don’t need to do with the graph model.

The graph model is much more easily extensible and over 1,000 times faster than a relational database when working with connected data.

Example: Google and Facebook

In their early days, both Facebook and Google offered a basic “keyword” search, where users would type in a word or phrase and get back a list of all results that included those keywords.

This method relied on plain pattern recognition, and many users found it to be a cumbersome process of repeatedly redefining search terms until the correct result was found.

Facebook’s database of people and Google’s database of information have one crucial thing in common: They were both built using graph technology. And in recent years, both Google and Facebook have realized they could make much better use of their huge swathes of searchable content, and have each launched new graph-based search services to exploit these commercial opportunities.

Realizing the limitations of keyword searches, Google launched its “Knowledge Graph” in 2012 and Facebook followed suit with its “Graph Search” service in 2013, both of which provide users with more contextual information in their searches.

As a result of these new services, both enterprises realized substantial lift in user engagement – and therefore commercial success.

Following in the footsteps of giants like Facebook, Google and adidas, new startups like Glowbl and Decibel – and many others – have also created graph-based search tools to discover new business insights, launch new products and services and attract new customers.

Conclusion

For businesses that have huge volumes of products, content or digital assets, graph-based search provides a better way to make this data available to users, as corporate giants Google and Facebook have clearly demonstrated.

The valuable uses of graph-based search in the enterprise are endless; customer support portals, product catalogs, content portals and social networks are just a few.

Graph-based search offers numerous competitive advantages, including better customer experience, more targeted content and increased revenue opportunities.

Enterprises that tap into the power of graph-based search today will be well ahead of their peers tomorrow.

Download your copy of this white paper, The Top 5 Use Cases of Graph Databases, and discover how to tap into the power of connected data at your enterprise.

Download the White Paper

Catch up with the rest of the “Graph Databases in the Enterprise” series:

Fraud Detection

Real-Time Recommendation Engines

Master Data Management

Network and IT Operations

Identity and Access Management

Enterprise Competitive Advantage

The post Graph Databases in the Enterprise: Graph-Based Search appeared first on Neo4j Graph Database.

↧

How Backstory.io Uses Neo4j to Graph the News [Community Post]

December 1, 2015, 4:00 am

≫ Next: Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post]

≪ Previous: Graph Databases in the Enterprise: Graph-Based Search

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

Backstory is a news exploration website I co-created with my friend Devin.

The site automatically organizes news from hundreds of sources into rich, interconnected timelines. Our goal is to empower people to consume news in a more informative and open-ended way.

The News Graph

Our ability to present and analyze news in interesting ways is based on an extensive and ever-growing “news graph” powered by Neo4j.

The core graph model is shown in simplified form below:

Learn How Backstory.io Uses Neo4j to Graph News Stories in a New Way

Consider three articles published by different news sources on November 16th, 2015.

First, Backstory collects these articles and stores them as ARTICLE nodes in the graph.

Second, article text is analyzed for named entities, stored as ACTOR nodes. Articles have a REFERENCED relationship with their actors.

Thirdly, these articles are clustered because they’re about the same thing: U.S. Secretary of State John Kerry visiting France after the terrorist attacks in Paris. The article cluster is represented by an EVENT node. All articles and actors in a cluster point to their news event with an IN_EVENT relationship.

Finally, all actors in the cluster point to one another using a dated WITH relationship, to record their co-occurrence.

Given enough data, this model allows us to answer interesting questions about the news with simple Cypher queries. For example:

What are the most recent news events involving John Kerry?

MATCH (:ACTOR {name: "John Kerry"})-[:IN_EVENT]-(e:EVENT) RETURN e ORDER BY e.date DESC LIMIT 10

When was the last time Islamism interacted with Paris?

MATCH (:ACTOR {name: "Islamism"})-[w:WITH]-(:ACTOR {name: "Paris"}) RETURN w.date ORDER BY w.date DESC LIMIT 1

How many news events involving France occurred this week?

MATCH (:ACTOR {name: "France"})-[:IN_EVENT]-(e:EVENT) WHERE e.date > 1447215879786 RETURN count(e) AS event_count

In addition to the information present in the news graph itself, we tap into a large amount of enriched data by virtue of correlating all actor nodes to Wikipedia entries.

For example, by including a field about the type of thing an actor is, a query can now differentiate a person from a place. Cypher has risen to the challenge and continues to allow for concise queries over a complexifying graph.

Neo4j For The Win

We are big Neo4j fans at Backstory. The graph technology and community has propelled us forward in many ways.

Here are just a few examples:

There Are Ample Neo4j Clients across Languages

In the Backstory system architecture – described in more detail here – there are a variety of components that read from and write to the graph database.

A combination of requirements and personal taste have led us to write these components in different languages, and we are pleased with the variety of options available for talking to Neo4j.

On the write-side, we use the the Neo4j Java REST Bindings. This component also uses a custom testing framework that allows us to run suites of integration tests against isolated, transient Neo4j embedded instances.

On the read-side, we’ve created an HTTP API that has codified the queries the Backstory.io website makes. This is written in Python and uses py2neo.

There’s also an ExpressJS API for administrative purposes, which constructs custom Cypher queries and manages its own transactions with Neo4j.

The Neo4j Browser Is a Crucial Experimentation Tool

The Neo4j Browser is an excellent tool for anything from experimenting with new Cypher queries to running some sanity checks on your production data.

Every Cypher-based feature I’ve developed for Backstory was conceived and hardened in the Browser. I even used it to develop the example queries above!

Graph Flexibility Is Underrated

Early on in our design process for Backstory we were a bit skeptical of using a graph database. Was it really worth leaving the comfort zone of relational databases or key-value stores?

Even after we had committed to a Neo4j prototype, we expected to end up requiring secondary relational storage for any number of requirements outside of the core news graph.

It turns out Neo4j has sufficed for all of our persistent data requirements, and has even led us to novel solutions in several cases. Four quick examples:

Ability to latently add indexes

Using Neo4j as an article queue

Using the graph to cluster articles

Using Neo4j for Named Entity recognition

Conclusion

As mentioned above, our goal with Backstory is to create better ways for people to consume news and understand the world. Part of this is having a world-class technology platform for collecting and analyzing news.

Neo4j’s vibrant community and the flexibility of the graph database are enabling us to achieve these goals.

Instead of thinking about our database simply as a place where bits are stored, we think of our data as alive and brimming with insights. The graph lets our data breathe, striking the right balance between structure and versatility. Meanwhile, Cypher queries continue to perform well as the model grows more complex.

The Neo4j-powered news graph is absolutely the centerpiece of our system, and we’re excited for what the future holds.

If you’d like to follow our progress, join the mailing list on http://backstory.io or give us a follow on Twitter at @backstoryio.

Ready to use Neo4j for your next app or project? Get everything you need to know about harnessing graphs in O’Reilly’s Graph Databases – click below to get your free copy.

Get My Free Copy

The post How Backstory.io Uses Neo4j to Graph the News [Community Post] appeared first on Neo4j Graph Database.

↧

Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post]

January 21, 2016, 4:00 am

≫ Next: 3 RDBMS & Graph Database Deployment Strategies (Polyglot & More)

≪ Previous: How Backstory.io Uses Neo4j to Graph the News [Community Post]

[As community content, this post reflects the views and opinions of the particular author and does not necessarily reflect the official stance of Neo4j.]

The Problem of Discovery

Discovery, especially non-text discovery, is hard.

When looking for a cool T-shirt, for example, I might not know exactly what I want, only that I’m looking for a gift T-shirt that’s a little mathy that emphasizes my friend’s love of nature.

As a retailer, I might notice that geometric nature products are quite popular, and want to capitalize by marketing the more general “math/nature” theme to potential buyers who have demonstrated an affinity for mathy animal shirts as well as improving the browsing experience for new visitors to my site.

Many retail sites with user-generated content rely on user-generated tags to classify image-driven products. However, the quality and number of tags on each item vary widely and depend on the item’s creator and the administrators of the site to curate and sort into browsable categories.

On Threadless, for example, this awesome item has a rich amount of tags:

lim heng swee, ilovedoodle, cats, lol, funny, humor, food, foodies, food with faces, pets, meow, ice cream, desserts,awww, puns, punny, wordplay, v-necks, vnecks, tanks, tank tops, crew sweatshirts, Cute

In contrast, this beautiful item has only a handful:

jimena salas, jimenasalas, funded, birds, animals, geometric shapes, abstract, Patterns

Furthermore, although a human might easily be able to classify an image with the tags [ants, anthill, abstract, goofy] as probably belonging to the “funny animals” category, an automated system would have to know that ants are animals and that goofy is a synonym for funny.

Knowing this, how would a retail site quickly and cheaply implement intelligent categorization and tag curation? ConceptNet5 and (of course), Neo4j.

ConceptNet5

This article introduces the ConceptNet dataset and describes how to import the data into a Neo4j database.

To paraphrase the ConceptNet5 website, ConceptNet5 is a semantic network built from nodes representing words or short phrases of natural language (“terms” or “concepts”), and the relationships (“associations”) between them.

Armed with this information, a system can take human words as input and use them to better search for information, answer questions and understand user goals.

For example, take a look at toast in the ConceptNet5 web demo:

Learn How to Leverage of Non-Text Discovery by using the ConceptNet Dataset within Neo4j

This looks remarkably similar to a graph model. The dataset is incredibly rich, including (in the JSON) the “sense” of toast as a bread and also as a drink one has in tribute.

Let’s take a look at the JSON response for one ConceptNet edge (the association between two concepts) and import some data into a Neo4j database for exploration:

{
     edges: 
     [
          {
               context: "/ctx/all",
               dataset: "/d/globalmind",
               end: "/c/en/bread",
               features: 
               [
                    "/c/en/toast /r/IsA -",
                    "/c/en/toast - /c/en/bread",
                    "- /r/IsA /c/en/bread"
               ],
               id: "/e/ff9b268e050d62255f236f35ba104300551b8a3b",
               license: "/l/CC/By-SA",
               rel: "/r/IsA",
               source_uri:                                              
               "/or/[/and/[/s/activity/globalmind/assert/,/s/
               contributor/omcs/bugmenot/]/,/s/umbel/2013/]",
               sources: 
               [
                    "/s/activity/globalmind/assert",
                    "/s/contributor/omcs/bugmenot",
                    "/s/umbel/2013"
               ],
               start: "/c/en/toast",
               surfaceText: "Kinds of [[bread]] : [[toast]]",
               uri: "/a/[/r/IsA/,/c/en/toast/,/c/en/bread/]",
               weight: 3
          },
}

Modeling the Database

For the purposes of this example, let’s model the database to have the following properties: Term Nodes:

concept
language
partOfSpeech
sense

Association Relationships:

type
weight
surfaceText

An alternate model could have “type” be the relationship label instead of a property, but for the sake of this blog post let’s keep types as properties. This allows us to explore the ConceptNet database without making assumptions about the types of relationships in the dataset.

Loading the Data into the Database

Let’s use the following Python script to upload some sample data:

import requests
import json
from py2neo import authenticate, Graph
 
USERNAME = "neo4j" #use your actual username
PASSWORD = "12345678" #use your actual password
authenticate("localhost:7474", USERNAME, PASSWORD)  
graph = Graph()

#sample_tags = ['fruit','orange','bikes','cream','nature', 'toast','electronic', 'techno', 'house', 'dubstep', 'drum_and_bass', 'space_rock', 'psychedelic_rock', 'psytrance', 'garage', 'progressive','Cologne', 'North_Rhine-Westphalia', 'gothic_rock', 'darkwave' 'goth', 'geometric', 'nature', 'skylines', 'landscapes', 'mountains', 'trees', 'silhouettes', 'back_in_stock', 'Patterns', 'raglans','giraffes', 'animals', 'nature', 'tangled', 'funny', 'cute', krautrock]

# Build query.
query = """
WITH {json} AS document
UNWIND document.edges AS edges
WITH 
SPLIT(edges.start,"/")[3] AS startConcept,
SPLIT(edges.start,"/")[2] AS startLanguage,
CASE WHEN SPLIT(edges.start,"/")[4] <> "" THEN SPLIT(edges.start,"/")[4] ELSE "" END AS startPartOfSpeech,
CASE WHEN SPLIT(edges.start,"/")[5] <> "" THEN SPLIT(edges.start,"/")[5] ELSE "" END AS startSense,
SPLIT(edges.rel,"/")[2] AS relType,
CASE WHEN edges.surfaceText <> "" THEN edges.surfaceText ELSE "" END AS surfaceText,
edges.weight AS weight,
SPLIT(edges.end,"/")[3] AS endConcept,
SPLIT(edges.end,"/")[2] AS endLanguage,
CASE WHEN SPLIT(edges.end,"/")[4] <> "" THEN SPLIT(edges.end,"/")[4] ELSE "" END AS endPartOfSpeech,
CASE WHEN SPLIT(edges.end,"/")[5] <> "" THEN SPLIT(edges.end,"/")[5] ELSE "" END AS endSense
MERGE (start:Term {concept:startConcept, language:startLanguage, partOfSpeech:startPartOfSpeech, sense:startSense})
MERGE (end:Term  {concept:endConcept, language:endLanguage, partOfSpeech:endPartOfSpeech, sense:endSense})
MERGE (start)-[r:ASSERTION {type:relType, weight:weight, surfaceText:surfaceText}]-(end)
"""

# Using the Search endpoint to load data into the graph
for tag in sample_tags:
	searchURL = "http://conceptnet5.media.mit.edu/data/5.4/c/en/" + tag + "?limit=500"
	searchJSON = requests.get(searchURL, headers = 
	{"accept":"application/json"}).json()
	graph.cypher.execute(query, json=searchJSON)

Exploring the Data

Use the following Cypher query to explore the data:

MATCH (n:Term {language:'en'})-[r:ASSERTION]->(m:Term {language:'en'})
WHERE 
NOT r.type = 'dbpedia' AND
NOT r.surfaceText = '' AND
NOT n.partOfSpeech = '' AND
NOT n.sense = ''
RETURN n.concept AS `Start Concept`, n.sense AS `in the sense of`, r.type, m.concept AS `End Concept`, m.sense AS `End Sense`
ORDER BY r.weight DESC, n.sense ASC
LIMIT 10

The ConceptNet dataset is incredibly rich, providing various “senses” in which someone might mean “orange” and provides a wide variety of “relationship types” to choose from.

    | Start Concept | in the sense of                                         | r.type     | End Concept     | End Sense
----+---------------+---------------------------------------------------------+------------+-----------------+-----------
  1 | orange        | colour                                                  | IsA        | color           |
  2 | orange        | film                                                    | InstanceOf | film            |
  3 | dynamic       | a_characteristic_or_manner_of_an_interaction_a_behavior | Synonym    | nature          |
  4 | garage        | a_petrol_filling_station                                | Synonym    | petrol_station  |
  5 | garage        | a_petrol_filling_station                                | Synonym    | fill_station    |
  6 | garage        | a_petrol_filling_station                                | Synonym    | gas_station     |
  7 | progressive   | advancing_in_severity                                   | Antonym    | non_progressive |
  8 | shop          | automobile_mechanic's_workplace                         | Synonym    | garage          |
  9 | electronic    | band                                                    | IsA        | band            |
 10 | cream         | band                                                    | IsA        | band            |

Use Cases and Future Directions

When translated into a graph database, the ConceptNet5 API takes the agony out of tag-based recommendations and categorizations.

Small retail and social startups can integrate a Neo4j microservice into their currently existing stack, using it to power recommendations, provide insights on what is the most effective way to categorize products (should “funny cats” have their own first-level category, or should they go under “animals”?), and allow more time and budget for richer innovations.

References

Loading JSON into a Neo4j Database

Dealing with Empty Columns

Data

ConceptNet5 (thanks to Marvin Minsky, Luminoso, Push Singh and the MIT Media Lab)
WordNet

Learn how to build a real-time recommendation engine for non-text discovery on your website: Download this white paper – Powering Recommendations with a Graph Database – and start offering more timely, relevant suggestions to your users.

Get the White Paper

The post Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post] appeared first on Neo4j Graph Database.

↧

3 RDBMS & Graph Database Deployment Strategies (Polyglot & More)

March 14, 2016, 4:00 am

≫ Next: Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

≪ Previous: Non-Text Discovery with ConceptNet as a Neo4j Database [Community Post]

Whether you’re ready to move your entire legacy RDBMS into a graph database, you’re syncing databases for polyglot persistence or you’re just conducting a brief proof of concept, at some point you’ll want to bring a graph database into your organization or architecture.

Once you’ve decided on your deployment strategy, you’ll then need to move some (or all) of your data from your relational database into a graph database. In this blog post, we’ll show you how to make that process as smooth and seamless as possible.

Your first step is to ensure you have a proper understanding of the native graph property model (i.e., nodes, relationships, labels, properties and relationship-types), particularly as it applies to your given domain.

In fact, you should at least complete a basic graph model on a whiteboard before you begin your data import. Knowing your data model ahead of time – and the deployment strategy in which you’ll use it – makes the import process significantly less painful.

In this RDBMS & Graphs blog series, we’ll explore how relational databases compare to their graph counterparts, including data models, query languages, deployment strategies and more. In previous weeks, we’ve explored why RDBMS aren’t always enough, graph basics for the RDBMS developer, relational vs. graph data modeling and SQL vs. Cypher as query languages.

This week, we’ll discuss three different database deployment strategies for relational and graph databases – as well as how to import your RDBMS data into a graph.

Three Database Deployment Strategies for Graphs and RDBMS

There are three main strategies to deploying a graph database relative to your RDBMS. Which strategy is best for your application or architecture depends on your particular goals.

Below, you can see each of the deployment strategies for both a relational and graph database:

Learn the Different Deployment Strategies for RDBMS & Graph Databases, Such as Polyglot Persistence

The three most common database deployment strategies for relational and graph databases.

First, some development teams decide to abandon their relational database all together and migrate all of their data into a graph database. This is typically a one-time, bulk migration.

Second, other developers continue to use their relational database for any use case that relies on non-graph, tabular data. Then, for any use cases that involve a lot of JOINs or data relationships, they store that data in a graph database.

Third, some development teams duplicate all of their data into both a relational database and a graph database. That way, data can be queried in whatever form is the most optimal for the queries they’re trying to run.

The second and third strategies are considered polyglot persistence, since both approaches use a data store according to its strengths. While this introduces additional complexity into an application’s architecture, it often results in getting the most optimized results from the best database for the query.

None of these is the “correct” strategy for deploying an RDBMS and a graph. Your team should consider your application goals, frequent use cases and most common queries and choose the appropriate solution for your particular environment.

Extracting Your Data from an RDBMS

No matter your given strategy, if you decide you need to import your relational data into a graph database, the first step is to extract it from your existing RDBMS.

Most all relational databases allow you to dump both whole tables or whole datasets, as well as carry results to CSV and to post queries. These tasks are usually just a copy function of the database itself. Of course, in many cases the CSV file resides on the database, so you have to download it from there, which can be a challenge.

Another option is to access your relational database with a database driver like JDBC or another driver to extract the datasets you want to pull out.

Also, if you want to set up a syncing mechanism between your relational and graph databases, then it makes sense to regularly pull the given data according to a timestamp or another updated flag so that data is synced into your graph.

Another facet to consider is that many relational databases aren’t designed or optimized for exporting large amounts of data within a short time period. So if you’re trying to migrate data directly from an RDBMS to a graph, the process might stall significantly.

For example, in one case a Neo4j customer had a large social network stored in a MySQL cluster. Exporting the data from the MySQL database took three days; importing it into Neo4j took just three hours.

One final tip before you begin: When you write to disk, be sure to disable virus scanners and check your disk schedule so you get the highest disk performance possible. It’s also worth checking any other options that might increase performance during the import process.

Importing Data via LOAD CSV

The easiest way to import data from your relational database is to create a CSV dump of individual entity-tables and JOIN-tables. The CSV format is the lowest common denominator of data formats between a variety of different applications. While the CSV format itself is unpopular, it’s also the easiest to work with when it comes to importing data into a graph database.

In Neo4j specifically, LOAD CSV is a Cypher keyword that allows you to load CSV files from HTTP or file URLs into your database. Each row of data is made available to your Cypher statement and then from those rows, you can actually create or update nodes and relationships within your graph.

The LOAD CSV command is a powerful way of converting flat data (i.e. CSV files) into connected graph data. LOAD CSV works both with single-table CSV files as well as with files that contain a fully denormalized table or a JOIN of several tables.

LOAD CSV allows you to convert, filter or de-structure import data during the import process. You can also use this command to split areas, pull out a single value or iterate over a certain list of attributes and then filter them out as attributes.

Finally, with LOAD CSV you can control the size of transactions so you don’t run into memory issues with a certain keyword, and you can run LOAD CSV via the Neo4j shell (and not just the Neo4j browser) which makes it easier it script your data imports.

In summary, you can use Cypher’s LOAD CSV command to:

Ingest data, accessing columns by header name or offset
Convert values from strings to different formats and structures (toFloat, split, …)
Skip rows to be ignored
MATCH existing nodes based on attribute lookups
CREATE or MERGE nodes and relationships with labels and attributes from the row data
SET new labels and properties or REMOVE outdated ones

A LOAD CSV Example

Here’s a brief example of importing a CSV file into Neo4j using the LOAD CSV Cypher command.

Example file: persons.csv

name;email;dept

"Lars Higgs";"lars@higgs.com";"IT-Department"

"Maura Wilson";"maura@wilson.com";"Procurement"

Cypher statement:

LOAD CSV FROM 'file:///data/persons.csv' WITH HEADERS AS line

FIELDTERMINATOR ";"

MERGE (person:Person {email: line.email}) ON CREATE SET p.name = line.name

MATCH (dep:Department {name:line.dept})

CREATE (person)-[:EMPLOYEE]->(dept)

You can import multiple CSV files from one or more data sources (including your RDBMS) to enrich your core domain model with other information that might add interesting insights and capabilities.

Other, dedicated import tools help you import larger volumes (10M+ rows) of data efficiently, as described below.

The Command-Line Bulk Loader

The neo4j-import command is a scalable input tool for bulk inserts. This tool takes CSV files and scales them across all of your available CPUs and disk capacity, putting the data into a stage architecture where each input step is parallelized if possible. Then, the tool stages step-by-step input using some advanced in-memory compression for creating new graph structures.

The command-line bulk loader is lightning fast, able to import up to one million records per second and handle large datasets of several billion nodes, relationships and properties. Note that because of these performance optimizations the neo4j-import tool can only be used for initial database population.

Loading Data Using Cypher

For importing data into a graph database, you can also use the Neo4j REST API to run Cypher statements yourself. With this API, you can run, create, update and merge statements using Cypher.

The transactional Cypher HTTP endpoint is available to all drivers. You can also use the HTTP endpoint directly from an HTTP client or an HTTP library in your language.

Using the HTTP endpoint (or another API), you can pull the data out of your relational database (or other data source) and convert it into parameters for Cypher statements. Then you can batch and control import transactions from there.

From Neo4j 2.2 onwards, Cypher also works really well with highly concurrent writes. In one test, one million nodes and relationships per second were inserted with highly concurrent Cypher statements using this method.

The Cypher-based loading method works with a number of different drivers, including the JDBC driver. If you have an ETL tool or Java program that already uses a JDBC tool, you can use Neo4j’s JDBC driver to import data into Neo4j because Cypher statements are just query strings (more on the JDBC driver next week). In this scenario, you can provide parameters to your Cypher statements as well.

Other RDBMS-to-Graph Import Resources:

This blog post has only covered the three most common methods for importing (or syncing) data in a graph database from a relational store.

The following are further resources on additional methods for data import, as well as more in-depth guides on the three methods discussed above:

Next week we’ll take a look at connecting to a graph database via drivers and other integrations.

Want to learn more on how relational databases compare to their graph counterparts? Download this ebook, The Definitive Guide to Graph Databases for the RDBMS Developer, and discover when and how to use graphs in conjunction with your relational database.

Get the Ebook

Catch up with the rest of the RDBMS & Graphs series:

Why Relational Databases Aren’t Always Enough

Graph Basics for the Relational Developer

Relational vs. Graph Data Modeling

SQL vs. Cypher Query Languages

Drivers for Connecting to a Graph Database

The post 3 RDBMS & Graph Database Deployment Strategies (Polyglot & More) appeared first on Neo4j Graph Database.

↧

Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

April 8, 2016, 10:53 am

≫ Next: The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

≪ Previous: 3 RDBMS & Graph Database Deployment Strategies (Polyglot & More)

As the world has seen, the International Consortium of Investigative Journalists (ICIJ) has exposed highly connected networks of offshore tax structures used by the world’s richest elites.

These structures were uncovered from leaked financial documents and were analyzed by the journalists. They extracted the metadata of documents using Apache Solr and Tika, then connected all the information together using the leaked databases, creating a graph of nodes and edges in Neo4j and made it accessible using Linkurious’ visualization application.

In this post, we look at the graph data model used by the ICIJ and show how to construct it using Cypher in Neo4j. We dissect an example from the leaked data, recreating it using Cypher, and show how the model could be extended.

Discover How the Panama Papers Can be Analyzed Using Neo4j with Example Data Models, Queries & More

The Steps Involved in the Document Analysis

Acquire documents
Classify documents

Scan / OCR
Extract document metadata

Whiteboard domain

Determine entities and their relationships
Determine potential entity and relationship properties
Determine sources for those entities and their properties

Work out analyzers, rules, parsers and named entity recognition for documents
Parse and store document metadata and document and entity relationships

Parse by author, named entities, dates, sources and classification

Infer entity relationships
Compute similarities, transitive cover and triangles
Analyze data using graph queries and visualizations

A Data Model of Implied Company Connections

Finding triads in the graph can show inferred connection. Here Bob has an inferred connection to CompanyB through CompanyA.

From Documents to Graph

A simple model of the organizational domain of business inter-relationships in a holding is simple and similar to the model you use in business registries, a common use case for Neo4j. As a minimum you have:

Clients
Companies
Addresses
Officers (both natural people and companies)

With these relationships:

(:Officer)-[:is officer of]->(:Company)

With these classifications:

protector
beneficiary, shareholder, director
beneficiary
shareholder

(:Officier)-[:registered address]->(:Address)
(:Client)-[:registered]->(:Company)
(:Officer)-[:has similar name and address]->(:Address)

All these entities have a lot of properties, like document numbers, share amounts, start- and end-dates of involvements, addresses, citizenship and much more. Two entities of the same name can have very different amounts of information attached to them, though this depends on the relevant information that was extracted from the sources, e.g., some officers have only a name, others have a full record with more than 15 attributes.

Those have specific relationships like a person is the “officer of” a company. This is a basic domain that you can populate from documents about a tax haven shell company holding, a.k.a. the #PanamaPapers.

Initially you classify the raw documents by types and subtypes (like contract or invitation). Then you attach as much direct and indirect metadata as you can, either directly from the document types (like the senders and receivers of an email or parties of a contract). Inferred metadata is gained from the content of the documents. There are techniques like natural language processing, named entity recognition or plain text search for well-known terms like distinctive names or roles.

The first step to build your graph model is to extract those named entities from the documents and their metadata. This includes companies, persons and addresses. These entities become nodes in the graph. For example, from a company registration document, we can extract the company and officer entities.

Some relationships can be directly inferred from the documents. In the previous example, we would model the officer as directly connected to the company:

(:Officer)-[:IS_OFFICER_OF]->(:Company)

Other relationships can be inferred by analyzing email records. If we see several emails between a person and a company we can infer that the person is a client of that company:

(:Client)-[:IS_CLIENT_OF]->(:Company)

We can use similar logic to create relationships between entities that share the same address, have family ties or business relationships or that regularly communicate.

Direct metadata -> entities -> relationships to documents

author, receivers, account-holder, attached to, mentioned, co-located
Turn plain entities / names into full records using registries and profile documents

Inferred metadata and information from other sources -> Relationships between entities

Related to people or organizations from the direct metadata
Same addresses / organizations
Find peer groups / rings within fraudulent activities
Family ties, business relationships
Part of the communication chain

The graph data model used by the ICIJ

Issues with the ICIJ Data Model

There are some modeling and data quality issues with the ICIJ data model.

The ICIJ data contains a lot of duplicates, but only a few of which are connected by a “has similar name or address” relationship, mostly those can be inferred by first and last part of a name together with addresses and family ties. It would also be beneficial for the data model to actually merge those duplicates, then certain duplicate relationships could also be merged.

In the ICIJ data model, shareholder information like number of shares, issue dates, etc. is stored with the “Officer” where the officer can be shareholder in any number of Companies. It would be better to store that shareholder information on the “is officer of – Shareholder” relationship.

Some of the Boolean properties could be represented as labels, eg. “citizenship=yes” could be a Person label.

How Could You Extend the Basic Graph Model Used by the ICIJ?

The domain model used by the ICIJ is really basic, just containing four types of entities (Officer, Client, Company, Address) and four relationships between them. It is more or less a static view on the organizational relationships but doesn’t include interactions or activities. Looking at the source documents and the other activities outlined in the report, there are many more things which can enrich this graph model to make it more expressive.

We can model the original documents and their metadata and the relationships to people. Part of those relationships are also inferred relationships from being part of conversations or being mentioned or the subject of documents. Other interesting relationships are aliases and interpretations of entities that were used during the analysis, which allows other journalists to reproduce the original thought processes.

Also, the sources for additional information like business registries, watch-lists, census records or other journalistic databases can be added. Human relationships like family or business ties can be created explicitly as well as implicit relationships that infer that the actors are part of the same fraudulent group or ring.

Another aspect that is missing are the activities and the money flow. Examples of activities are opening/closing of accounts, creation or merger of companies, filing records for those companies or assigning responsibilities. For the money flow, we could track banks, accounts and intermediaries used with the monetary transactions mentioned, so you can get an overview of the amounts transferred and the patterns of transfers. Those patterns can then be applied to extract additional fraudulent money flows from other transaction systems.

Graph data is very flexible and malleable, as soon as you have a single connection point, you can integrate new sources of data and start finding additional patterns and relationships that you couldn’t trace before.

New Entities:

Documents: E-Mail, PDF, Contract, DB-Record, …
Money Flow: Accounts / Banks / Intermediaries

New Relationships

Family / business ties
Conversations
Peer Groups / Rings
Similar Roles
Mentions / Topic-Of
Money Flow

Let’s Look at a Concrete Example

Let’s look at the family of the Azerbaijan’s President Ilham Aliyev who was already the topic of a GraphGist by Linkurious in the past. We see his wife, two daughters and son depicted in the graphic below.

The Azerbaijan President's Fraud Ring Analyzed by Linkurious

Quoting the ICIJ “The Power Players” Publication (emphasis for names added):

The family of Azerbaijan President Ilham Aliyev leads a charmed, glamorous life, thanks in part to financial interests in almost every sector of the economy. His wife, Mehriban, comes from the privileged and powerful Pashayev family that owns banks, insurance and construction companies, a television station and a line of cosmetics. She has led the Heydar Aliyev Foundation, Azerbaijan’s pre-eminent charity behind the construction of schools, hospitals and the country’s major sports complex. Their eldest daughter, Leyla, editor of Baku magazine, and her sister, Arzu, have financial stakes in a firm that won rights to mine for gold in the western village of Chovdar and Azerfon, the country’s largest mobile phone business. Arzu is also a significant shareholder in SW Holding, which controls nearly every operation related to Azerbaijan Airlines (“Azal”), from meals to airport taxis. Both sisters and brother Heydar own property in Dubai valued at roughly $75 million in 2010; Heydar is the legal owner of nine luxury mansions in Dubai purchased for some $44 million.

We took the data from the ICIJ visualization and converted the 2d graph visualization into graph patterns in the Cypher query language. If you squint, you can still see the same structure as in the visualization. We only compressed the “is officer of – Beneficiary, Shareholder, Director” to IOO_BSD and prefixed the other “is officer of” relationships with IOO.

We didn’t add shares, citizenship, reg-numbers or addresses that were properties of the entities or relationships. You can see them when clicking on the elements of the embedded original visualization.

Cypher Statement to Set Up the Visualized Entities and Relationships

CREATE
(leyla: Officer {name:"Leyla Aliyeva"})-[:IOO_BSD]->(ufu:Company {name:"UF Universe Foundation"}),
(mehriban: Officer {name:"Mehriban Aliyeva"})-[:IOO_PROTECTOR]->(ufu),
(arzu: Officer {name:"Arzu Aliyeva"})-[:IOO_BSD]->(ufu),
(mossack_uk: Client {name:"Mossack Fonseca & Co (UK)"})-[:REGISTERED]->(ufu),
(mossack_uk)-[:REGISTERED]->(fm_mgmt: Company {name:"FM Management Holding Group S.A."}),

(leyla)-[:IOO_BSD]->(kingsview:Company {name:"Kingsview Developents Limited"}),
(leyla2: Officer {name:"Leyla Ilham Qizi Aliyeva"}),
(leyla3: Officer {name:"LEYLA ILHAM QIZI ALIYEVA"})-[:HAS_SIMILIAR_NAME]->(leyla),
(leyla2)-[:HAS_SIMILIAR_NAME]->(leyla3),
(leyla2)-[:IOO_BENEFICIARY]->(exaltation:Company {name:"Exaltation Limited"}),
(leyla3)-[:IOO_SHAREHOLDER]->(exaltation),
(arzu2:Officer {name:"Arzu Ilham Qizi Aliyeva"})-[:IOO_BENEFICIARY]->(exaltation),
(arzu2)-[:HAS_SIMILIAR_NAME]->(arzu),
(arzu2)-[:HAS_SIMILIAR_NAME]->(arzu3:Officer {name:"ARZU ILHAM QIZI ALIYEVA"}),
(arzu3)-[:IOO_SHAREHOLDER]->(exaltation),
(arzu)-[:IOO_BSD]->(exaltation),
(leyla)-[:IOO_BSD]->(exaltation),
(arzu)-[:IOO_BSD]->(kingsview),

(redgold:Company {name:"Redgold Estates Ltd"}),
(:Officer {name:"WILLY & MEYRS S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"LONDEX RESOURCES S.A."})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"FAGATE MINING CORPORATION"})-[:IOO_SHAREHOLDER]->(redgold),
(:Officer {name:"GLOBEX INTERNATIONAL LLP"})-[:IOO_SHAREHOLDER]->(redgold),
(:Client {name:"Associated Trustees"})-[:REGISTERED]->(redgold)

Linked Entities in the Panama Papers Data Visualized in Neo4j

Interesting Queries

Family Ties via Last Name:

MATCH (o:Officer) 
WHERE toLower(o.name) CONTAINS "aliyev"
RETURN o

Family Ties by Last Name in the Azerjaiban Data

Family Involvements:

MATCH (o:Officer) WHERE toLower(o.name) CONTAINS "aliyev"
MATCH (o)-[r]-(c:Company)
RETURN o,r,c

Who Are the Officers of a Company and Their Roles:

	MATCH (c:Company)-[r]-(o:Officer) WHERE c.name = "Exaltation Limited"
RETURN *

Company Officers and Roles in the Azerbaijan Data

Show Joint Company Involvements of Family Members

MATCH (o1:Officer)-[r1]->(c:Company)<-[r2]-(o2:Officer)
WITH o1.name AS first, o2.name AS second, count(*) AS c, 
     collect({ name: c.name, kind1: type(r1), kind2:type(r2)}) AS involvements
WHERE c > 1 AND first < second
RETURN first, second, involvements, c

Joint Company Involvement of Family Members in the Azerbaijan Data

Resolve Duplicate Entities

MATCH (o:Officer) 
RETURN toLower(split(o.name," ")[0]), collect(o.name) as names, count(*) as count

Resolving Duplicate Entities in the Azerbaijan Data

Resolve Duplicate Entities by First and Last Part of the Name

MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0] + " " + name_parts[-1] as name,  collect(o.name) AS names, count(*) AS count
WHERE count > 1
RETURN name, names, count
ORDER BY count DESC

Resolve Duplicate Data Entities by First and Last Part of Name

Transitive Path from Mossack to the Officers in that Example

MATCH path=(:Client {name: "Mossack Fonseca & Co (UK)"})-[*]-(o:Officer)
WHERE none(r in rels WHERE type(r) = "HAS_SIMILIAR_NAME")
RETURN [n in nodes(path) | n.name] as hops, length(path)

The Transitive Path between Mossack Fonseca and Company Officers in the Panama Papers Data

Shortest Path between Two People

MATCH (a:Officer {name: "Mehriban Aliyeva"})
MATCH (b:Officer {name: "Arzu Aliyeva"}) 
MATCH p=shortestPath((a)-[*]-(b))
RETURN p

Finding a Shortest Path in Neo4j in the Azerbaijan Data

Further Work – Extension of the Model

Merge Duplicates

Create a person node and connect all officers to that single person. Reuse our statement from the duplicate detection.

MATCH (o:Officer)
WITH split(toLower(o.name), " ") AS name_parts, o
WITH name_parts[0]+ " " + name_parts[-1] AS name, collect(o) AS officers


// originally natural people have a “citizenship” property
WHERE name CONTAINS "aliyev"

CREATE (p:Person { name:name })
FOREACH (o IN officers | CREATE (o)-[:IDENTITY]->(p))

Introduce Family Ties between Those People


CREATE (ilham:Person {name:"ilham aliyev"})
CREATE (heydar:Person {name:"heydar aliyev"})
WITH ilham, heydar
MATCH (mehriban:Person {name:"mehriban aliyeva"})

MATCH (leyla:Person {name:"leyla aliyeva"})
MATCH (arzu:Person {name:"arzu aliyeva"})

FOREACH (child in [leyla,arzu,heydar] | CREATE (ilham)-[:CHILD_OF]->(child) CREATE (mehriban)-[:CHILD_OF]->(child))
CREATE (leyla)-[:SIBLING_OF]->(arzu)
CREATE (leyla)-[:SIBLING_OF]->(heydar)
CREATE (arzu)-[:SIBLING_OF]->(heydar)
CREATE (ilham)-[:MARRIED_TO]->(mehriban)

Show the Family

MATCH (p:Person) RETURN p

The Aliyev Family in the Azerbaijan Data

Family Ties to Companies

MATCH (p:Person) WHERE p.name CONTAINS "aliyev"
OPTIONAL MATCH (c:Company)<--(o:Officer)-[:IDENTITY]-(p) 
RETURN c,o,p

GraphGist

You can explore the example dataset yourself in this interactive graph model document (called a GraphGist). You can find many more for various use-cases and industries on our GraphGist portal.

Related Information

E-Mail Analysis

Investigative Journalism

Existing GraphGists

Want to start your own project like this using Neo4j? Click below to get your free copy of O’Reilly’s Graph Databases ebook and get started with graph databases today.

Download My Free Copy

The post Analyzing the Panama Papers with Neo4j: Data Models, Queries & More appeared first on Neo4j Graph Database.

↧

The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

October 21, 2016, 1:30 am

≫ Next: Detecting Fake News with Neo4j & KeyLines

≪ Previous: Analyzing the Panama Papers with Neo4j: Data Models, Queries & More

Catch this week’s 5-Minute Interview with Tom Zeppenfeldt, Director and Founder at Graphileon

For this week’s 5-Minute Interview, I chatted with Tom Zeppenfeldt, Director and Founder at Graphileon in the Netherlands. Tom and I chatted this past summer about what’s new at Graphileon.

Here’s what we covered:

Tell us a bit about yourself and about how you use Neo4j.

Tom Zeppenfeldt: I’m the Founder and Owner of Graphileon. We became a Neo4j solutions partner last April, but I already had a lot of experience working with Neo4j.

The project that we worked on before becoming a Neo4j partner was to create a platform for investigative journalists — the type of reporters who work on stories like the Panama Papers. And our main product at Graphileon is what we call the InterActor, which is a heavily enhanced user interface that communicates with Neo4j.

Can you share a bit more technical details about how that product works?

Tom: Of course. With the journalism project we’re working on, we ran into some limitations because we aren’t what I would call “hardcore ITers.” And we were looking for a user interface that people like us — who had been using Excel — could easily use. We needed a tool that would allow us to create and browse networks; create new nodes, tables and charts; and all different kinds of graph data.

Although we were working on the journalism project, we realized that if we made the tool generic, everyone who uses Neo4j could have this useful add-on. It’s always good to have some tool at your side that allows you to browse and do discovery and exploration in your Neo4j store in order to build prototype applications or applications that only have a short lifetime.

What made Neo4j stand out when you were exploring different technology solutions?

Tom: One of the main draws was Cypher; that was crucial. As I mentioned, we are not hardcore IT people, but Cypher — in terms of all the ASCII art-like pattern matching it allows you to do — was really easy to use. We’ve become more advanced and now consider ourselves power users of Neo4j.

The database is very easy to work with. You don’t have to go through a lot of technical studying to be able to create good data models or write your queries. But a user interface was still lacking.

Let’s say if you compare it to the standard user interface that comes with Neo4j — the Neo4j Browser — we have multiple panels. We can copy nodes from one panel to another, and we can also access different graph stores at the same time. We have shortcuts and even dynamic Cypher, which is very interesting.

For instance, imagine that you want to select a number of nodes that are linked in a node set. From that, you can automatically derive a kind of pattern and then send that to the database to give what we call isomorphs, or similar structures. This allows you to query and return all the places in your data where you have the same structure on the same path.

What have been some of the most surprising or interesting results you’ve seen using Neo4j?

Tom: The moment we started playing with dynamic Cypher was very interesting, especially once we found the correct division between software tiers such as the database and front-end tiers.

We started working with Cypher results as result maps that look and smell like a node, so it’s treated as a node by the database. That allowed us to make nice visualizations of aggregations of soft nodes, virtual nodes and virtual relationships. The fact that you can merge them into a result — even if you are combining your Cypher query data from different node types or nodes with different labels — makes it very easy to work with.

If you could take everything you know about Neo4j now and go back to the beginning, is there anything you would do differently?

Tom: Once we knew Cypher really well, we saw pitfalls in some of our models. My advice would be to try and limit the scope of your search with your Cypher statements as early as possible. In the first month or two, we struggled because we didn’t understand Cypher completely, which led to some mistakes. But if you can optimize those queries you can achieve huge improvements in performance.

For example, if you are doing traversals, opening a node to see what is inside is time consuming. Sometimes it’s better to use a node value instead of property because it allows you to use nodes instead of real values so you can search for ranges between those relationships.

As in any database, whether it’s a relational database, a document database or a graph database, you always have to consider the type of queries your want to perform. You can’t just make the model without knowing what kinds of questions you want to ask.

Anything else you’d like to add? Any closing thoughts?

Tom: It’s very interesting for us to see how quickly Neo4j develops. We started with version 1.0 and the difference between that version and what we have now is huge.

Since we build a lot of prototype applications, we are really pleased with the new functions every time they’re added. For instance, at a certain stage you added the keys functions and the properties functions, which makes developing a lot faster for us. Of course, we are also interested in what openCypher will bring because as more people start to use Cypher this will push further development of the language.

Want to share about your Neo4j project in a future 5-Minute Interview? Drop us a line at content@neo4j.com

Using graph databases for journalism or investigation?
Read this white paper The Power of Graph-Based Search, and learn to leverage graph database technology for more insight and relevant database queries.

Discover Graph-Based Search

The post The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon appeared first on Neo4j Graph Database.

↧

Detecting Fake News with Neo4j & KeyLines

May 9, 2017, 12:00 am

≫ Next: Connecting the Tech Stack: 5-Minute Interview with Tim Ward, CEO at CluedIn

≪ Previous: The 5-Minute Interview: Tom Zeppenfeldt, Founder of Graphileon

Fake news is one of the more troubling trends of 2017. The term is liberally applied to discredit everything, from stories with perceived bias through to ‘alternative facts’ and downright lies. It has a warping effect on public opinion and spreads misinformation.

Fake news is nothing new – bad journalism and propaganda have always existed – but what is new is its ability to spread through social media.

In this post, we’ll see how graph analysis and visualization techniques can help social networking sites stop the spread of fake news. We’ll see how, like fraud, fake news detection is about understanding networks. We’ll discuss how Neo4j and the KeyLines graph visualization toolkit can power a comprehensive fake news detection process.

A quick note: For simplicity, in this post we’ll limit the term ‘fake news’ to describe completely fictitious and unsubstantiated articles (see examples like PizzaGate and the Ohio lost votes story).

How Is Fake News a Graph Problem?

To detect fake news, it’s essential to understand how it spreads online – between accounts, posts, pages, timestamps, IP addresses, websites, etc. Once we model these connections as a graph, we can differentiate between normal behaviors and abnormal activity where fake content could be shared.

Let’s get started.

Building Our Graph Data Model

When we’re detecting fraud, we usually rely on verifiable, watch-list friendly, demographic data like real names, addresses or credit card details. With fake news, we don’t have this luxury, but we do have data on social networking sites. This can give us useful information, including:

Accounts (or Pages)
Posts
Articles

There are many ways to model this data as a graph. We usually start by mapping the main items to nodes: account, post and article. We know that IP addresses are important, so we can add those as nodes too. Everything else is added as a property:

Our fake news detection graph model

Detection vs. Investigation

Fake news spreaders are just as determined as regular fraudsters. They’ll adapt their behavior to avoid detection, and employ bots to run brute-force attacks.

Relying on algorithmic or manual detection isn’t enough. We need a hybrid approach that combines automated detection and manual investigation:

The process model for fake news detection between Neo4j and KeyLines

A simplified model showing fake news detection powered by Neo4j and KeyLines

Automated detection

Neo4j

Manual investigation

New behaviors are fed back into the automated process, so automated detection rules can adapt and become more sophisticated.

Detecting Fake News with Neo4j

Once we’ve created our data store, we can run complex queries to detect high-risk content and accounts.

Here’s where graph databases like Neo4j offer huge advantages over traditional SQL or relational databases. Queries that could take hours now take seconds and can be expressed using intuitive and clean Cypher queries.

For example, we know that fake news botnets tend to share content in short bursts, using recently registered accounts with few connections. We can run a Cypher query that:

Returns all accounts:

that have fewer than 20 friend connections
that shared a link to www.example.com
between 12.07pm – 12.37pm on 13 February 2017

In Cypher, we’d simply express this as:

MATCH (account:Account)--(ip:IP)--(post:Post)--(article:Article)
      WHERE account.friends < 20 AND article.url = 'www.example.com'
            AND post.timestamp > 1486987620000
            AND post.timestamp < 1486989420000
RETURN account

Investigating Fake News with KeyLines

To seek out ‘unknown fraud’ – cases that follow patterns that can’t be fully defined yet – our manual investigation process looks for anomalous connections.

Learn how to use KeyLines and Neo4j to detect fake news on social media through graph visualization

Visual investigation tools provide an intuitive way to uncover unusual connections that could indicate fake content

A graph data visualization tool like KeyLines is essential for this.

Building a Visual Graph Model

Let’s define a visual graph model so we can start to load data from Neo4j into KeyLines.

It’s not a great idea to load every node and link in our source data. Instead we should focus on the minimal viable elements that tell the story, and then add other properties as tooltips or nodes later.

We want to see our four key node types, with glyphs to highlight higher-risk data points like:

New accounts
Posts that have been reported by users
URLs that have previously been associated with fake content

This gives us a visual model that looks like this:

A visual data model of high-risk data points using glyphs in KeyLines

Loading the Data

To find anomalies, we need to define normal behavior. Graph visualization is the simplest way to do this.

Here’s what happens when we load 100 Post IDs into KeyLines:

Data loading of Facebook post IDs into KeyLines

Loading the metadata of 100 Facebook posts into KeyLines to identify anomalous patterns

Our synthesized dataset is simplified, with a lower rate of sharing activity and more anomalies than real-world data. But even in this example we can see both normal and unusual behavior:

Normal social media user news sharing behavior

Normal user sharing behavior, visualized as a graph

Normal posts look similar to our data model – featuring an account, IP, post and article. Popular posts may be attached to many accounts, each with their own IP, but generally this linear graph with no red glyphs indicates a low-risk post.

Other structures in the graph stand out as unusual. Let’s take a look at some examples.

1. Monitoring New Users

New users should always be treated as higher risk than established users. Without a known history, it’s difficult to understand a user’s intentions. Using the New User glyph, we can easily pick them out:

Non-suspicious user behavior social media post

A non-suspicious post being shared by a new user

A pattern of unusual user sharing behavior that might be fake news

This structure is much more suspicious, with a new user sharing flagged posts to articles on known fake news domains

2. Identifying Unusual Sharing Behavior

We can also use a graph view to uncover suspicious user behavior. Here’s one strange structure:

A deviant pattern of user sharing behavior that might be fake news

An anomalous structure for investigation

We can see one article has been shared multiple times, seemingly by three accounts with the same IP address. By expanding both the IP and article nodes, we can get a full view of the accounts associated with the link farm.

3. Finding New Fake News Domains

In addition to monitoring users, social networks should monitor links to domains known for sharing fake news. We’ve represented this in our visual model using red glyphs on the domain node. We’ve also used the combine feature to merge articles on the same domain:

Combining domain tracking using graph visualization

Using combos to see patterns in article domains

This view shows the websites being shared, rather than just the individual articles. We can pick out suspicious domains:

Suspicious fake news website domain sharing pattern

Try It for Yourself

This post is just an illustration of how you can use graph visualization techniques to understand the complex connected data associated with social media. We’ve used simplified data and examples to show how graph analysis could become part of the crackdown on fake news.

We’d love to see how this approach works using real-world data. Catch my lightning talk or stop by our table at GraphConnect on 11th May to see how we could work together!

References

The dataset we used here was synthesised from the following two sources:

Cambridge Intelligence is a Silver sponsor of GraphConnect Europe. Use discount code CAMBRIDGE30 to get 30% off your tickets and trainings.

Join us at the Europe's premier graph technology event: Get your ticket to GraphConnect Europe and we'll see you on 11th May 2017 at the QEII Centre in downtown London!

Sign Me Up

The post Detecting Fake News with Neo4j & KeyLines appeared first on Neo4j Graph Database.

↧