Confirmed Sessions at Graph Day SF

We we have 18 more sessions to announce. Bookmark this page for updates.

Knowledge Graph in Watson Discovery

Anshu Jain / Nidhi Rajshree - IBM

When we extracted information from 5 million Wikipedia documents, we obtained a knowledge graph of 30 million entities and 200 million relationships. With yet another Watson Discovery client, we ingested 80 million documents resulting in 5 billion relationships. Our main goal was to discover knowledge from this graph. In doing so, there were two key challenges we faced, 1) discovering “un-obvious” knowledge from a knowledge graph of entities and relationships and 2) doing it at scale. In this talk, we describe the approach we took in tackling both these challenges by redefining data models, leveraging multiple backend technologies best suited for storing different aspects of the data, creating redundant and cache-like stores to optimize the query workloads, re-thinking our rank and retrieval algorithms such that we don’t compromise on precision and the discovery quotient while designing for scale.
Technical skills and concepts required: Basic knowledge of Natural Language Processing, Information Retrieval and Database Retrieval concepts

Globally distributed, horizontally scalable graphs with Azure Cosmos DB

Aravind Krishna R - Microsoft

Learn about how you can work with massive scale graphs using Microsoft’s new Azure Cosmos DB service. Cosmos DB lets you interact with graphs using Apache TinkerPop’s Gremlin APIs along with providing turn-key global distribution, elastic scaling of storage and throughput,

On-boarding with JanusGraph Performance

Chin Huang - IBM Open Technologies

When approaching a new technology, an upfront evaluation of its performance is necessary. Graph databases support a flexible data model that allows users to easily represent and manage domain specific data. Meanwhile, there are a number of variables in graph modeling and implementation mechanisms that will influence the performance of loading and querying graph data. With one of the latest graph databases available, JanusGraph, we evaluated various graph workloads in order to understand the performance characteristics and to identify system requirements. In this talk, we will share with the audience our performance test approach, the data, schema, tools, and methodology we used. We will also show the results of JanusGraph performance, provide recommendations on achieving better graph performance, and investigate how to apply the same approach to other graph databases.

Intended audience:
Graph data model designer
Graph data application developer
Graph database operator

Technical skills required:
Basis understanding of graph concepts and performance metrics

Building a Graph Data Pipeline

Paul Sterk / George Tretyakov - Ten-x

Are you thinking about implementing a Graph Database? Are you wondering how to transform your existing datasets into a Graph model? At Ten-X we built a complex, multi-stage Graph Data Pipeline that sources, filters, de-dupes, transforms, loads and manages different sets of data in Janus-Graph. We would like to share some of these insights and hard-earned lessons with you especially in how to deal with poorly documented, complex and dirty legacy datasets. We will talk about a third-party service you can use to greatly ease your ability to de-duplicate any geo-orientated records (such as customer addresses) as well as a compelling data enrichment story. We will also cover approaches for converting data records into vertices and edges, strategies for transforming and creating a graph database ‘load-ready’ dataset, and thoughts on our technology stack (Hadoop, Hive, Spark, TinkerPop, JanusGraph, Cassandra and Elastic Search).
- Intended audience: engineers, architects
- technical skills and concepts required: familiarity with a Big Data Stack and a Graph Data

Cross-Device Pairing with Apache Giraph

Chin Huang / Obuli Krishnaraj Venkatesan - Drawbridge

In this talk we will talk about how Giraph is used at Drawbridge to match pairs of devices, and ultimately assign each pair to an anonymous consumer. Using Giraph, we identify more than 10 billion cross-device pairs among 5 Billion identities spread across computers, smartphones, tablets, and ultimately connected TVs. We will also talk about how adopting Giraph has reduced the runtime, while increasing the quality and scale.
- Our graph has >5 Billion vertices and >10 Billion edges and generates >20 Billion pairs.
- When we used Hadoop MapReduce to generate the pairs, it took more than 24 hours to finish.
- Treating this problem as a graph problem and using Giraph brought down the runtime to less than an hour.
- The Giraph implementation is more flexible and gives us more capability to tune at the individual vertex level. It has also resulted in double-digit improvement in precision.
- We use 800 workers on 100 machines with 20 TB of combined memory.

We will also talk about the hardware changes that we did to our Hadoop cluster, and how we are using Giraph as a Java action in Oozie workflow. We will also talk about some of the changes that we did in the computation and data representation that helped us in improving the performance by balancing between CPU, Memory and Network bandwidth usage to achieve optimum run-time.

Tinkering the Graph

Karthik Karuppaiya - Ten-X / Chris Pounds - Expero

OK, so you had your Zen moment and suddenly you realized that Graphs are everywhere and the best way to model and store your data is as a Graph. The next big thing is how do you actually productize the graph database and let your products access the data. In this talk we will talk about how we did exactly that at Ten-X. JanusGraph, Cassandra, Ansible, SpringBoot, Docker and Mesos are some of the technologies we have used to make our platform production ready. We will share how Janus Graph is deployed in our production environment. We will also talk about the RESTful API layer we built Using SpringBoot and TinkerPop for performing CRUD operations and search queries on the JanusGraph Database.

Comparing Giraph and GraphX

Jenny Zhao / Yu Gan - Drawbridge

At Drawbridge we process one of the largest graphs in the industry with more than 11 billion vertices and 70 billion edges. In order to run complex algorithms (i.e. intelligent clustering) efficiently at this scale, we rely on optimizing the best distributed graph processing technologies. In this talk we will compare Giraph and GraphX based on performance, tuning, integration with Spark and how they fit in our platform. We will present an actual use case, demonstrating feature generation for machine learning on our graph using 3 different technologies: Spark, Giraph, and GraphX. In addition, we showcase an in-house python utility module used for collecting and monitoring Giraph performance metrics. Not only its current implementation covers critical data for debugging and profiling algorithms, but also provides it the framework of flexible extension for various use cases.

JanusGraph: Today and Looking to the Future

Ted Wilmes - Expero

Graph databases are no longer just the new kids on the block, but maturity doesn't mean that they can't be a little edgy. Research in data engines can be applied in graph databases, and open-sourced projects like JanusGraph are a great place to do it. Join Ted as he looks into the internals of JanusGraph and consider how the engine can be extended and enhanced with modern day research conjectures and proposals inspired by other database engines and academia.

Successful Techniques for a Large Distributed Database: the Comcast XNET Platform

Jaya Krishna - Tulasea / Ravi Lingam - Comcast

XNet is a platform that captures network change events in a graph database. Currently, XNet handles 10 Data Sources producing 5 billion change events/day across multiple Regions/Data Centers and 200 million consumer queries/day across multiple Regions/Data Centers. XNET supports 24/7 availability Inter-Data Center Failover. XNET is designed for growth to expand to 5 additional data centerst, and to handle hundreds of data sources and a trillion events/day

Knowledge Graph Platform: Going Beyond the Database

Michael Grove - Stardog

Adoption of Graph Databases in the enterprise is gaining momentum, and while they are suitable alternatives to other database types for many use cases, the real power of graphs is not yet being utilized generally. Graph offers more than just traversals and convenient analytics, they offer a transformative platform that lets an enterprise create knowledge from data. This flexibility, combined with a formal logical model, allows data in all its forms—structured, semi-structured, and unstructured—to blend seamlessly into a single, coherent Knowledge Graph. The formal logical model can not only encode business logic as part of the graph itself, but it is also an ideal, declarative way to define graph structure, while retaining its flexibility. It can also be used as the basis for alignment between disparate data sources, or a way to enrich the data before advanced NLP, machine learning, or analytics are performed over all of an enterprise's knowledge.

Project Konigsburg - A GraphAI

Denis Vrdoljak / Danny Wudka - Berkeley Data Science Group

In this presentation, we will talk about our research and development towards creating an AI that can predict connections within graph networks. Unlike typical prediction methods based on counting wedges (e.g., counting “mutual friends”) or requiring outside knowledge (e.g., syncing with email or contacts lists), we will talk about how we employed different triadic measurements to engineer features for our machine learning models to predict connections based on relationship patterns, specific to different applications. We will also go over some of the applications we have in mind for our system – including recommending stores and restaurants based on social connections’ shopping patterns, predicting future social or professional contacts, and even possible applications in counterintelligence and counterterrorism.
We will cover some of the challenges that we faced, like biased uncertainty in training data, single classifier approaches, and limitations of existing graph databases , and adapting heuristics based on application – after all, we don’t expect recommending coffee-shops to work with the same parameters as identifying sleeper cells!
Finally, we will review the different machine learning models that we evaluated, talk about their trade-offs, and conclude with a brief demo of our system in action, and talk about some of the new developments and possibilities we learned at Data Day Texas earlier this year.

Graph-based Taxonomy Generation

Rob McDaniel - Live Stories

How do you automatically generate a taxonomy from a corpus? Getting topics is easy, but organizing them into any meaningful hierarchy is expensive. This talk will cover the real-world application of graph-based taxonomy generation from a weighted topic graph, as proposed by Treeratpituk et al. Included in this lecture will be a brief overview of multi-level graph partitioning, the generation of an edge and vertex weighted graph and a basic open-source implementation and samples.

Graphs in Genomics

Jason Chin - Pacific Biosciences

Since the discovery of DNA molecules, graph theory and methods have been used in analyzing genomes. Recently progress in high throughput DNA sequencing instrument development has pushed the state of art using graph for understanding genomics further. Jason will present recent advances in this field to the data science community.
Jason will begin by going over a few examples where graphs are used to encode genomics information for human health. He will then dive a bit into the graph theory used for a specific problem: "genome assembly" - essentially, how currently bioinformatists use graphs to put millions of smaller pieces of DNA sequences (hundred gigabyte data) into contiguous genome sequences (several Mb to several GB) in practice.
Jason will 1) define the problem, 2) give an overview of the general approach, 3) compare different topological and statistical properties of assembly graph to other kinds of graphs, e.g., social network or small world graph, and 4) demonstrate a specific end-to-end example for people to see the whole process. Jason will wrap up the talk with a view toward future challenges: computation scaling, new related theoretical problems and standardization for related graph processing.

Investigating patterns of human trafficking through graph visualization

Christian Miles - Cambridge Intelligence

It is estimated that at any given time, 2.5 million people are in forced labour (including sexual exploitation) as a result of trafficking. The vast majority of victims are between 18 and 24 years of age. In this talk, Christian will walk the audience through the steps taken to collect, visualize and analyze a unique dataset of 22,500 classified advertisements in order to identify potential indicators of human trafficking. This talk and associated work follows a similar methodology for analysis as previous studies completed in Hawaii but applies a distinctly graph-oriented approach to a whole new geography. Methods used will include graph modelling/visualization, web scraping, text mining and geospatial analysis. Christian will demonstrate how his analysis reveals valuable new insights into trafficking routes and highlights patterns of exploitation that could be used to prevent trafficking crimes.

Time for a new relation: Going from RDBMS to Graph

Patrick McFadin, DataStax

Most of our introductory graph sessions come from practitioners with a heavy graph background. Patrick McFadin will present a session from the perspective of someone with a broad relational background (at scale) who has recently started working with graphs.
Like many of you, I have a good deal of experience building data models and applications using a relational database. Along the way you may have learned to data model for non-relational databases, but wait! Now we are seeing Graph databases increase in popularity and here’s yet another thing to figure out. I’m here to help! Let’s take all that hard won database knowledge and apply it to building proper Graph based applications. You should take away the following:
- How graph creates relations differently than an RDBMS
- How to insert and query data
- When to use a graph database
- When NOT to use a graph database
- Things that are unique to a graph database