3 Questions related to Performance and Scalability
-
What about scalability? It has been a major barrier for adoption
-
RDF triple stores have some scalability issues, but dedicated graph databases, such as Neo4j, scale very large…
-
The idea of triple stores is very compelling in theory but in practice I have seen systems relying on them struggle to deliver query performance even for simple cases. Have you found this, and if so how have you overcome it?
While the early implementations of triple stores experienced performance and scalability issues, I believe these challenges are being overcome with proper architecture, data modeling, tuning of existing offerings, and entry of new players into the space. Relational systems have had decades for tuning and performance, and graph datastores are at last making great strides in this area. January 30, 2019 2 Most of my work is in proof of concept and prototypes, so I lack experience with large graphs. This information may be helpful: https://www.w3.org/wiki/LargeTripleStores
Once we have all the data in a graphdb, I’m still not getting how everyone consumes the data, e.g. science. And what happens to the reports/tables we currently produce – do they go away? #still_learning
Initially, we need to support reports and tables as currently required for data review, publication, and submission. This is why our project transforms the data back to SAS XPT. Consider a future where Tables, Figures, and Listings are dynamic web documents. Every value is a hyperlink that can be traversed by mouse-click or query to metadata, provenance information, or links to other data. Consuming the data is not really that different, with a SPARQL query executed instead of SQL. I regularly execute SPARQL within R to create R dataframes for further manipulation, analysis, and visualization. The consumer not need to know (or even care) that the data comes from a graph database and not their familiar relational store. There is a lack of standardized userfriendly interfaces at the moment, and most work is very bespoke.
Is there a reason to be limited to clinical studies? Why not include invivo and invitro research studies in the research or preclinical space?
There is vast potential (and much existing work) outside of the clinical study realm. Explore the Life Sciences domain within the Linked Open Data Cloud:
https://lod-cloud.net/
Most effective ways include the development and publication of successful use cases.
Publishing and presenting at conferences like the PhUSE (www.phuse.eu), SWAT4HCLS (http://www.swat4ls.org/) will stimulate development. Include industry conferences not known for Knowledge Graph content, and local meet-up January 30, 2019 3 events. This wider exposure will help build the momentum needed to move this technology into the mainstream for pharma.
In which areas of pharma, do you see the initial successful use cases? (clinical, safety, research, etc).
Pubmed is a prime example in the research area, with over 138 billion triples as of 2019-01-21. https://pubchemdocs.ncbi.nlm.nih.gov/rdf-statistics Omics, drug discovery, and non-clinical are strong bets for early success.
As a beginner, what is the best way to get my feet wet and do some hands on work? Any recommendation on books or tools to use. Dive in! Find a problem you would like to solve, and try doing that with Linked Data.
What are examples of a low hanging fruit to demonstrate success so management will invest the resources to implement this.
Step one is to align with management initiatives to secure support. Identify their priorities, then isolate aspects of the problem that can be solved in a Proof of Concept. Examples may be as diverse as easily identifying who worked on a past January 30, 2019 4 project, who has a specific expertise, where the data is located for a study, and when and where was that study published? You must solve a simple problem that makes an impact and is not merely an academic demonstration. In one example, I linked :
- Internal study information: indication, phase, study population, etc.
- Study information from ClinicalTrials.gov
- Study data locations and time stamps on a file server
- Contact information of content creators
I built a small ontology to support the entities. It was not one hundred percent correct or aligned with other ontologies, but it allowed me to develop the graph model and answer questions for management. Don’t worry about being absolutely correct in your first proof of concept. You will learn along the way and that is okay.
Would implementation by some of the large CROs help promote this across their multiple clients?
The more implementations, the better! From my previous experience at a CRO, I know there will be challenges because you must satisfy the many and diverse client requirements. At the same time, the CRO environment presents unique opportunities, because a graph database is capable of bridging diverse data sources, and is very flexible and adaptable.
How do we start with ontology development. Any tools?
I use Protege and recommend the Protege short course in Stanford, CA (March 27- 29) https://protege.stanford.edu/short-courses.php
We also use TopBraid Composer in our project work: https://www.topquadrant.com/tools/modeling-topbraid-composer-standard-edition/
This paper by Noy and McGuiness is a classic: https://protege.stanford.edu/publications/ontology_development/ontology101.pdf
The classic book “Semantic Web for the Working Ontologist”: http://workingontologist.org/
Many ontologies are available online. Find one in your domain of interest and upload it into an ontology editor to see how the pieces fit together.
What is your view on graph databases such as Neo4J?
I like really like Neo4j. In our first Hands-on workshops we split our time between Neo4j and RDF. First time users find it easy to interactively create nodes and relations, and then query them within the same interface. It is a great way to experience linked data and the application has come a long way in the last couple of years. I’ve seen some really nice projects in Neo4j.
That said, my opinion is that clinical trials results data and research data is best suited to a native RDF store (Neo4j fans will point out that it now supports RDF). This is a debate best left for another day. There is no reason to not experience both, including other solutions like GRAKIN.AI .
Does linked data potentially increase the risk of re-identification of annonymised or unblinding of blinded clinical data?
This is an important concern for technologies that facilitate the joining of data from disparate sources. We must work closely with those who are designing guidance around de-identification and anonymization of data. On a positive note, I recommend looking into the #Solid project headed by Sir Tim Berners-Lee for insight into how Linked Data may be used as a way for people to take back control of their data online. https://solid.mit.edu/
What software/tools (graph database, rdf triple store) to recommand to handle huge data such as 100+ Billion triples?
I am not in a position to endorse one tool over another, so I suggest you look at applications that support the large online triple stores like DBpedia and the information available here: https://www.w3.org/wiki/LargeTripleStores
The problem with the technology that I have encountered is not complexity, but unreliability. In particular, Virtuoso cannot handle complex queries, and it simply fails to respond if it runs out of memory. There is a computer science challenge here, to efficiently shard a triple store in the light of the actual query stream. A good topic for a public/private partnership.
My own experience with lack of reliability has been with online endpoints that are poorly maintained or have ceased to exist. Some of these issues can be solved with improvements in systems architecture, data modeling, etc. More vendors are entering in this space, with many hyping increased performance. I think the idea of public/provide partnerships is a good one to help solve these challenges.
Do you know about identifiers.org/the miriam repository? maybe this could be leveraged to help handle the study ID standards?
Thank you! We are looking into identifiers.org is for ideas on how to approach the Study URI concept. Please follow along with us and add your input at: https://github.com/phuse-org/LinkedDataEducation/blob/master/doc/StudyURI.md
Do you have any examples on how you would address the versioning issue in RDF?
Versioning remains a challenge. Simple versioning can be accomplished with the PROV ontology (https://www.w3.org/TR/prov-primer/ ) . On a larger scale, you could use named graphs as snapshots, but the storage requirements would become immense. You can find some interesting discussion and references here: https://project-hobbit.eu/versioning-for-big-linked-data-approaches-and-benchmarks/ https://pdfs.semanticscholar.org/ed73/3deb78dfffd53279ec3eb768477646204236.pdf
Recommendations: 1. start small – grow models, 2. Easier to sell FAIR than Linked Data
Agreed! It much easier to understand the concepts and benefits of FAIR than “Let’s link all the things! ” The terms Semantic Web, Linked Data, RDF, Graph Data… have January 30, 2019 7 not aided acceptance. By adopting the easily-understood FAIR concepts, Linked Data (RDF) comes along for the ride
Question: How have you sold graph data bases vs relational? Is it necessary? Seems as though people are more comfortable using data models/knowledge graphs but maintaining relational backends.
This is another fundamental challenge. People are familiar and comfortable with relational databases. It has become how we think about data. Relational databases are not going anywhere soon. No company will rip-out existing relational systems and replace them with graph databases. Rather, graph databases will first exist as a glue between data silos using virtual graphs to bring together very disparate data sources. What we need to sell is “relational + graph data” an leverage the strengths of each.
On a related topic, you may wish to read my recent PhUSE paper “Overcoming Resistance to Technology Change: A Linked Data Perspective.” http://www.phusewiki.org/docs/Frankfut%20Connect%202018/TT/Papers/TT01-tt04-19214.pdf
What do you mean by “looking at graph?” Do you mean looking at using graph databases?
I mean we should foster a graph mindset for how we look at data and the processes that surround it, and how manage this data within graph databases. My enthusiasm for this topic often gets in the way of clarity.
Could Tim comment on why he uses {Vendor X} please?
Choosing the right graph vendor from the many now in the market place is an important consideration. Do you they understand your data sources, the questions you need to answer, your regulatory requirements? Does their company philosophy align with yours, and are they willing to partner you on a project? Be wary of companies and consultants who over-promise. There have been several false-starts already in our industry. January 30, 2019 8 I recommend starting with a vendor-agnostic consultant who has proven understanding of both the pharmaceutical industry and graph technology spaces. The consultant should take the time to understand your data, processes, budget, timeline, etc. and pair you with the right vendor and technology.
Ontologies for integration only work if we use shared ontologies – how do we get there?
We must continue to break down barriers to cooperation in the pre-competitive space by building on the successes of PhUSE, Pistoia Alliance, and others. Management must be convinced of the benefits of this cooperation (we all rise together). Ontologies should be designed in modular fashion with general concepts that can be extended with proprietary information within companies, without any expectation of those modifications returning to the general community. It is definitely a challenge.
Inferencing is based on known rules, how does this help us to detect “new” information
These rules can help us to infer relationships that were not explicitly stated in our source data. Take the example triple that states a Person participatesIn a Clinical Trial. It is easy to define a rule that states “wherever you see a participatesIn predicate, the Subject of that relationships is HumanStudySubject”. The data can then be queried for HumanStudySubjects, even though that information was not stated in the original source data. The unintended consequences of inferencing must be considered. What if you link to another dataset that describe how a Person participtesIn an event, like a protest rally!
SPIN rules would be a better example for rules-based data derivation, where a person’s age could be derived based on the date the rule is executed and the person’s birthdate. Rules like these can be study-dependent (based on the study design and protocol), where age could be calculated based on screening date, date of adverse event, or other timepoint. January 30, 2019 9
Can u talk about SEND and RDF alignment. Any ontologies mapping initiatives details?
Most of my work is within the late-phase clinical trials realm. I invite you to look for the upcoming publication from the Pistoia Alliance: “Beyond SEND: Leveraging Nonclinical Data to Drive Translational Research Forward.”
Our current “Clinical Trials Data as RDF” project will transform into a new project in the coming weeks that will include SEND. A sub team will investigate and leverage existing SEND ontologies where possible. I encourage anyone with an interest in this area to contact me and join our project as a contributor or to listen in.
Did the CDISC Blue Ribbon Commission have any feedback on Graph as a future direction?
New efforts within CDISC are investigating graph approaches to move the organization into the future. Look for more news from them in the very near future.
ROI examples which pull together preclinical information through to post marketing?
This is the elusive “ROI Unicorn”, a Moonshot for our industry. My advice is to concentrate on Roofshots that make incremental, production impacts, leveraging cooperation and the flexible nature of graph data to build toward the Moonshot of end-to-end connectivity.
The i2b2-tranSMART Foundation has a very active effort in medical ontologies, and an ontology working group… just FYI.
Thank you! I will look into it.
Have you taken a look at the approach at the Allotrope Foundation?
(https://www.allotrope.org/about-us, https://www.allotrope.org/solution)
I only recently became aware of Allotrope. Thank you for these links.