Who should use bigdata?" class="wow_main_float_head_img">

Who should use bigdata?

Comments · 660 Views

It can be confusing to navigate the various triple store options out there. Which one is best for your application?

It can be confusing to navigate the various triple store options out there. Which one is best for your application?

Let’s take a step back and look at the history of bigdata. Bigdata was not developed in a vacuum. Bryan and I were building a system for an intelligence community customer that used a triple store as the core of the data layer. This system allowed users to federate and semantically align different structured and unstructured data sets into a single fused view for better analysis. The system had a triple store knowledge base at the data layer, and then user-facing tools that allowed analysts to do things like import structured data, send unstructured documents through a harvest/extract pipeline, search for documents and entities, view graphical link charts, annotate documents, and a host of other things. The system also had an open RESTful service API, which allowed other tools to access the knowledge base to do reads, writes, and queries. The system was multi-user, so it had to handle real-time updates and deletes with concurrent queries. The knowledge base had to be fast enough to keep up with system load and scalable enough to handle lots of data. RDF was a great technology choice for the problem, but we found the RDF database implementations a bit lacking or a bit expensive, or both. And no vendor was tackling scale-out at that time. This was also around the time of Google’s publication on BigTable, and we thought, can we apply these fundamental concepts to RDF data?

Bigdata is not just for applications with multi-billion triple requirements. The single-server version of bigdata is an excellent choice for any system that needs a triple-store, it’s robust, fast, and handles concurrency very well. Bigdata handles real-time updates and deletes with incremental inference and incremental truth maintenance. Concurrent writes are serialized, but in the system for which we designed bigdata, these updates and deletes of about 10-1000 triples were absorbed almost instantly. Meanwhile bigdata’s MVCC concurrency model allows readers to operate totally independently of writers and other readers, so there’s never any waiting for reads or queries to execute. And when they do execute, they go through bigdata’s high-performance join engine for lightning fast query response times.

If you are dissatisfied with the performance, robustness, or feature-completeness of your current triple store (as we were), then look no further. Bigdata was borne of the same dissatisfaction, and designed and implemented specifically for real-world systems like yours.

Comments