Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, ́ Frances Perry, Eric Schmidt, Sam Whittle; The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing; In Proceedings of the Conference on Very Large Data Bases (VLDB), Volume 8, Number 12; 2015-08-31; 12 pages; Google, paywall
Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems.
We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that even- tually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.
In this paper, we present one such approach, the Dataflow Mode , along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development
- Daniel J. Abadi , Don Carney , Ugur Çetintemel , Mitch Cherniack , Christian Convey , Sangdon Lee , Michael Stonebraker , Nesime Tatbul , Stan Zdonik, Aurora: a new model and architecture for data stream management, The VLDB Journal — The International Journal on Very Large Data Bases, v.12 n.2, p.120-139, August 2003 [doi>10.1007/s00778-003-0095-z]
- Tyler Akidau , Alex Balikov , Kaya Bekiroğlu , Slava Chernyak , Josh Haberman , Reuven Lax , Sam McVeety , Daniel Mills , Paul Nordstrom , Sam Whittle, MillWheel: fault-tolerant stream processing at internet scale, Proceedings of the VLDB Endowment, v.6 n.11, p.1033-1044, August 2013 [doi>10.14778/2536222.2536229]
- Alexander Alexandrov , Rico Bergmann , Stephan Ewen , Johann-Christoph Freytag , Fabian Hueske , Arvid Heise , Odej Kao , Marcus Leich , Ulf Leser , Volker Markl , Felix Naumann , Mathias Peters , Astrid Rheinländer , Matthias J. Sax , Sebastian Schelter , Mareike Höger , Kostas Tzoumas , Daniel Warneke, The Stratosphere platform for big data analytics, The VLDB Journal — The International Journal on Very Large Data Bases, v.23 n.6, p.939-964, December 2014 [doi>10.1007/s00778-014-0357-y]
- Apache. Apache Hadoop. http://hadoop.apache.org, 2012.
- Apache. Apache Storm. http://storm.apache.org, 2013.
- Apache. Apache Flink. http://flink.apache.org/, 2014.
- Apache. Apache Samza. http://samza.apache.org, 2014.
- R. S. Barga et al. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Proc. of the Third Biennial Conf. on Innovative Data Systems Research (CIDR), pages 363–374, 2007.
- Irina Botan , Roozbeh Derakhshan , Nihal Dindar , Laura Haas , Renée J. Miller , Nesime Tatbul, SECRET: a model for analysis of the execution semantics of stream processing systems, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010 [doi>10.14778/1920841.1920874]
- Oscar Boykin , Sam Ritchie , Ian O’Connell , Jimmy Lin, Summingbird: a framework for integrating batch and online MapReduce computations, Proceedings of the VLDB Endowment, v.7 n.13, p.1441-1451, August 2014 [doi>10.14778/2733004.2733016]
- Cask. Tigon. http://tigon.io/, 2015.
- Craig Chambers , Ashish Raniwala , Frances Perry , Stephen Adams , Robert R. Henry , Robert Bradshaw , Nathan Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, June 05-10, 2010, Toronto, Ontario, Canada [doi>10.1145/1806596.1806638]
- B. Chandramouli et al. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. In Proc. of the 41st Int. Conf. on Very Large Data Bases (VLDB), 2015.
- Sirish Chandrasekaran , Owen Cooper , Amol Deshpande , Michael J. Franklin , Joseph M. Hellerstein , Wei Hong , Sailesh Krishnamurthy , Samuel R. Madden , Fred Reiss , Mehul A. Shah, TelegraphCQ: continuous dataflow processing, Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 09-12, 2003, San Diego, California [doi>10.1145/872757.872857]
- Jianjun Chen , David J. DeWitt , Feng Tian , Yuan Wang, NiagaraCQ: a scalable continuous query system for Internet databases, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.379-390, May 15-18, 2000, Dallas, Texas, USA [doi>10.1145/342009.335432]
- Jeffrey Dean , Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.10-10, December 06-08, 2004, San Francisco, CA
- EsperTech. Esper. http://www.espertech.com/esper/, 2006.
- Alan F. Gates , Olga Natkovich , Shubham Chopra , Pradeep Kamath , Shravan M. Narayanamurthy , Christopher Olston , Benjamin Reed , Santhosh Srinivasan , Utkarsh Srivastava, Building a high-level dataflow system on top of Map-Reduce: the Pig experience, Proceedings of the VLDB Endowment, v.2 n.2, August 2009 [doi>10.14778/1687553.1687568]
- Google. Dataflow SDK. https://github.com/GoogleCloudPlatform/DataflowJavaSDK, 2015.
- Google. Google Cloud Dataflow. https://cloud.google.com/dataflow/, 2015.
- Theodore Johnson , S. Muthukrishnan , Vladislav Shkapenyuk , Oliver Spatscheck, A heartbeat mechanism and its application in gigascope, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
- Jin Li , David Maier , Kristin Tufte , Vassilis Papadimos , Peter A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland [doi>10.1145/1066157.1066193]
- Jin Li , Kristin Tufte , Vladislav Shkapenyuk , Vassilis Papadimos , Theodore Johnson , David Maier, Out-of-order processing: a new architecture for high-performance stream systems, Proceedings of the VLDB Endowment, v.1 n.1, August 2008 [doi>10.14778/1453856.1453890]
- David Maier , Jin Li , Peter Tucker , Kristin Tufte , Vassilis Papadimos, Semantics of data streams and operators, Proceedings of the 10th international conference on Database Theory, January 05-07, 2005, Edinburgh, UK [doi>10.1007/978-3-540-30570-5_3]
- N. Marz. How to beat the CAP theorem. http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html, 2011.
- S. Murthy et al. Pulsar — Real-Time Analytics at Scale. Technical report, eBay, 2015.
- SQLStream. http://sqlstream.com/, 2015.
- Utkarsh Srivastava , Jennifer Widom, Flexible time management in data stream systems, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France [doi>10.1145/1055558.1055596]
- Ashish Thusoo , Joydeep Sen Sarma , Namit Jain , Zheng Shao , Prasad Chakka , Suresh Anthony , Hao Liu , Pete Wyckoff , Raghotham Murthy, Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, v.2 n.2, August 2009 [doi>10.14778/1687553.1687609]
- Peter A. Tucker , David Maier , Tim Sheard , Leonidas Fegaras, Exploiting Punctuation Semantics in Continuous Data Streams, IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, p.555-568, March 2003 [doi>10.1109/TKDE.2003.1198390]
- James Whiteneck , Kristin Tufte , Amit Bhat , David Maier , Rafael J. Fernández-Moctezuma, Framing the question: detecting and filling spatial-temporal windows, Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming, p.19-22, November 02-02, 2010, San Jose, California [doi>10.1145/1878500.1878506]
- F. Yang and others. Sonora: A Platform for Continuous Mobile-Cloud Computing. Technical Report MSR-TR-2012-34, Microsoft Research Asia.
- Matei Zaharia , Mosharaf Chowdhury , Tathagata Das , Ankur Dave , Justin Ma , Murphy McCauley , Michael J. Franklin , Scott Shenker , Ion Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, April 25-27, 2012, San Jose, CA
- Matei Zaharia , Tathagata Das , Haoyuan Li , Timothy Hunter , Scott Shenker , Ion Stoica, Discretized streams: fault-tolerant streaming computation at scale, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, November 03-06, 2013, Farminton, Pennsylvania [doi>10.1145/2517349.2522737]