The purpose of this post is to recap the most important points from recent Big Data in 30 hours Lecture 5.
What is a graph?
- Vertices – Vertices denote discrete objects, such as a person, a place, or an event.
- Edges – Edges denote relationships between vertices. For example, a person might know another person, be involved in an event, and recently been at a location.
- Properties – Properties express information about the vertices and edges. Example properties include a vertex that has name and age. An edge, which has a time stamp and/or a weight. More formally, this model is known as a property graph. Azure Cosmos DB supports the property graph model.
What is a graph database?
- Type of NoSql Datastore
- Stores data based on graph concepts (vertices, edges, properties)
Different types exist:
- Operational Graph Databases
- Knowledge Graph / RDF
- Multi-Modal Graphs
- Analytic Graphs
- Real-Time Native Parallel Graph
Common use cases for graph databases
- Fraud Detection
- Knowledge Graph
- Network and data-centers monitoring
- Recommendation Engines & Product Recommendation Systems
- Master Data Management – Customer 360
- Social Media & Social Networks
- Identity and Access management
- AI& Machine Learning
Apart from cases above, you generally might benefit from using Graph Database when dealing with highly connected data. If your data model looks more and more like a network of some sort you might consider using graph database.
Another common indicator for possible graph database usage is when your SQL queries start to include multiple joins over many tables to get data for a single page or query result. An example might be a movie website that’s main purpose is to display data for specific movies, tv shows and the people connected to them. One movie can have multiple actors, that might act in different roles, many script writers, directors etc. Now if you want to generate a page for a single movie you have to join many tables at once, which will be slower and slower as your data grows, unless you decide to heavily index your data or use some sort of caching mechanism.
Azure Cosmos DB Gremlin API – best data modeling practices
When using Cosmos DB as your graph store it is important to always remember that we are not dealing with a true graph database. Underneath there is just a document (json) database. Because of that, we will encounter some limitations in terms of analytical capabilities.
If you really want to use Cosmos DB as graph database note this:
Cosmos is built for huge number of concurrent writes and reads, but will only perform well if you provision enough throughput. If you’re getting 429 error codes while querying your data it means that max throughput was exceeded. And at this point you have three options to resolve the issue:
- Provision more throughput – this will work… until your data size and/or request rate increases
- Use partitioned collection – using partitioned document collection is the most performant option of all, but only if you choose your partition key correctly(hot partition problem). Also, it requires more throughput as total RU/s will be split among the partition key ranges. If you use partitioned collection and you won’t run queries involving entire graph traversal you might consider using bidirectional edges. By default, Cosmos stores outgoing edges with the source vertex. If a vertex has only incoming edges then queries involving that vertex will be transmitted to the partitions from which the incoming edges came from.This creates overhead that is pricy in terms of Request Units/s. While placing bidirectional edge try to include as much data as part of edge properties as you can. Bear in mind that choosing what properties to include where(vertex/edge) should be done by analyzing what kind of queries you will run in your graph.
- Remodel your data – You have no idea on how big performance boost you can gain if you redesign your graph schema. Don’t use big documents, if you have a vertex that has a list property change that list into other vertices. Place some of the data on the edges rather that create another vertex. This will limit your graph traversal overhead as you won’t need to go through the edge to get to the data.You just need to go to the edge.
Cosmos DB is great… If you can afford it ?
- Turnkey Global Distribution
- Limitless and elastic scalability of writes & reads
- Guaranteed low latency at 99th percentile
- Well-defined consistency choices
- Enterprise-grade performance and security
- Multi-model with native support for NoSQL APIs
If you don’t need all that options there are other databases you might try e.g. Neo4j.
If you plan to use Cosmos to query your big data, you might be disappointed. It’s great on heavy writes and reads (of small documents/ simple queries), but doing more than one or two hop queries might prove too much for it even if you use all the tricks it has to offer. Turning up the throughput will only make you bleed money faster. Don’t get me wrong, if used with partitioning, good modeling schema and bearing in mind that traversing millions of nodes is not a good idea then it’s pretty good graph store. You can try it for free for 30 days here: https://azure.microsoft.com/en-us/try/cosmosdb/
Graphs and Big Data
If you really want to unlock the analytical capabilities not seen before with highly connected data you should try TigerGraph.
TigerGraph is delivering the next stage in the evolution of the graph database:
- The first system capable of real-time analytics on web-scale data.
- Native Parallel Graph™ (NPG) design focuses on both storage and computation,
- Supporting real-time graph updates and offering built-in parallel computation.
- Offers SQL-like graph query language (GSQL) that provides ad-hoc exploration and interactive analysis of Big Data.
If we have terabytes of highly connected data, we might want to run our analytics with TigerGraph. Because it’s written in C++ and optimized for parallel query execution, we will experience performance boost unseen and unobtainable by other graph databases. You can go ahead and download TigerGraph for free (https://www.tigergraph.com/download), download for example the Twitter dataset (42M vertices, 1.5B edges, around 24GB in size) and see how fast the queries will perform. You can also generate synthetic graphs using scripts provided by TigerGraph’s team.
“TigerGraph is used by multiple large customers from Uber, Alipay to Wish.com and China Mobile, not only to model and query the data but also to compute graph-based attributes or features in machine learning lingo. In case of China Mobile, TigerGraph generates 118 new features per phone for each of their 600 million phones. This creates over 70 Billion new features to separate the “bad phones” – those with suspected fraud activity from the rest of the “good phones” belonging to regular customers. In case of China Mobile, more training data – 70 Billion features are generated and fed to machine learning solution to improve accuracy of fraud detection.” – Native Parallel Graphs The Next Generation of Graph Database for Real-Time Deep Link Analytics by Yu Xu and Victor Lee, PhD
If it comes to Big Data, then to my knowledge, no other graph database can handle real time analytics as well as TigerGraph. You can go to their website and download pdf with benchmark results against other popular graph databases (https://www.tigergraph.com/benchmark).You can find benchmarks against Cosmos DB on their github repo (https://github.com/tigergraph/ecosys/tree/benchmark/benchmark/cosmosdb).
In conclusion graphs are great for querying highly connected data. However most of currently available graph databases are not suited for Big Data. TigerGraph brings us closer to real-time analytics on truly big graphs. Until an open-source solution appears, TigerGraph will be the default choice for many companies.