How “Big Data” Went Bust | Slate

How “Big Data” Went Bust; ; In Slate; 2017-10-16.
Teaser: And what comes next.

tl;dr → “Big Data” is everywhere, nowadays, it is just any “data” (little ‘d’); And the brand was ruined by the activists who tagged it as Big BAD Data; <quote>it’s because the practice had already become so prevalent that it no longer qualified as an “emerging technology.”</quote>
and → Big Data is Facebook; Facebook is bad.
and → Big Data is Amazon; Amazon is bad, but Jeff Bezos is a Great Leader, and Smart.
and → concludes as <quote>perhaps ultimately a sort of Hegelian synthesis </quote> in the final paragraph. <snide> Mistakes will be made, only time will tell, told ya so!</snide> Yup. It’s a Freshman Seminar essay.

Hey ‘bot!

You’re reading this cultural analysis and prognostication in Slate. You going to be okay with that?  They publish articles with titles such as

  • Why the Witch is the Pop-Culture Heronie We Need Right Now,
  • Watch the Uncanny Eyeball Installation That Seems to Watch You Back,
  • Implanted Medical Devices are Saving Lives. they’re Also Causing Exploding Corpses.

OK? … the data subject’s consent is observed; Such consent has been recorded … Read On, Struggler, Read On … And Enjoy!

Mentioned

  • “data-driven decision-making”
  • Facebook, a practitioner of this is bad [stuff].
  • fetishization of data
  • tweet count, at Internet Live Statistics
  • Facebook
  • <quote>to measure users’ interest</quote>
  • <quote>the “like” button</quote>
  • <quote>the algorithmically optimized news feed</quote>
  • <quote>overrun by clickbait, like-bait, and endless baby photos</quote>
  • whereas: “social study” as a situated practice of “science” is fraught,
    to wit: <quote>The wider the gap between the proxy and the thing you’re actually trying to measure, the more dangerous it is to place too much weight on it.</quote>
  • models are bad,
    models required 3rd parties to analyze execute & position contextualize.
  • Michelle Rhee, ex-schools chancellor, Washington D.C.
  • <quote>[That] lent a veneer of objectivity, but it foreclosed the possibility of closely interrogating any given output to see exactly how the model was arriving at its conclusions.</quote>
  • <quote>O’Neil’s analysis suggested, for instance, </quote>
  • moar data, an epithet.
    c.f. moar defined at know your meme
  • “slow food,”
    is contra “fast food.”
  • Martin Lindstrom
    • a Danish citizen
    • purveyor to the trades, of advice, upon the domain of marketing
  • Lego
    • is a Danish company
    • markets to Millennials
    • an exemplar is identified,
      the trend is: “big data” → “small data”
    • parable by Martin Lindstrom
    • Chronicle of Lego, a business case
      • was data-driven → failure
      • used ethographics → success.
    • Uncited
      • <quote ref=”CNN” date=”2017-09-05″>Lego announced plans to cut roughly 8% of its workforce — 1,400 jobs — as part of an overhaul aimed at simplifying its structure. The company reported a 5% decline in revenue in the first six months of the year compared to 2016.</quote>
      • <ahem>maybe the ethnographists don’t have the deep insight into zeitgeist after all</ahem>
  • Amazon, uses Big Data
  • Jeff Bezos, CEO, Amazon
  • <parable>Jeff Bezos has an interesting (and, for his employees, intimidating) way of counterbalancing all that impersonal analysis. On a somewhat regular basis, he takes an emailed complaint from an individual customer, forwards it to his executive team, and demands that they not only fix it but thoroughly investigate how it happened and prepare a report on what went wrong.</quote> filed under: how the great ones do it.
  • <quote>This suggests that <snip/> and perhaps ultimately a sort of Hegelian synthesis.</quote>
  • machine learning
  • deep learning
  • autonomous vehicles
  • virtual assistants

Referenced

Previously

In archaeological order, in Slate

Actualities

Algorithmic Accountability: The Big Problems | SAP

Tom Slee (SAP); Algorithmic Accountability: The Big Problems; Their Blog; 2017-10.

tl;dr → You have problems, SAP has expertise in this practice area. Call now.

Original Sources

Yvonne Baur, Brenda Reid, Steve Hunt, Fawn Fitter (SAP); How AI Can End Bias; In Their Other Blog, entitled The D!gitalist; 2017-01-16.
Teaser: Harmful human bias—both intentional and unconscious—can be avoided with the help of artificial intelligence, but only if we teach it to play fair and constantly question the results.

Mentions

  • The Canon is rehearsed.
  • General Data Protection Regulation (GDPR)
    • European
    • “in effect in” 2018 (2018-05-28).

Indictment
Anti-patterns, Negative (Worst) Practices

  • Bad statistics
  • Ill-defined scales
  • Bad Incentives
  • Lack of transparency

Five Axes of Unfairness
Unfairness ↔ Disparate Impact

  1. Target variables
  2. Training data
  3. Feature selection
  4. Proxies
  5. Masking

Remediation

  • Explanation
  • Transparency
  • Audits
  • Fairness

Who

  • Solon Barocas, self [Princeton]
    Trade: theorist.
  • Cynthia Dwork, self [Microsoft]
    Trade: pioneer [theorist]..
  • Seth Flaxman, staff, Oxford University.
    Trade: expert.
  • Bryce Goodman, staff, Oxford University.
    Trade: expert.
  • Cathy O’Neil, self.
    Trade: data scientist statistician who works on a Macintosh Computer and lives in San Francisco.
  • Frank Pasquale, professor, law [Maryland]
    Ttrade: educator.
  • Andrew Selbst, self [U.S. .Court of Appeals]
    Trade: theorist

Referenced

Hidden Technical Debt in Machine Learning Systems | Scully, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, Young, Crespo, Dennison

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc, Jean-François Crespo, Dan Dennison; Hidden Technical Debt in Machine Learning Systems; technical report; 2015; 9 pages; earlier in Proceedings of the Workshop, Software Engineering for Machine Learning (SE4ML), at the Conference on Neural Information Processing Systems (NIPS), 2014.

Abstract

Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.

Previously

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young (Google); Machine Learning: The High-Interest Credit Card of Technical Debt; in Proceedings of the Workshop, Software Engineering for Machine Learning (SE4ML), at the Conference on Neural Information Processing Systems (NIPS), 2014; 9 pages; separately filled.

How Social Bias Creeps Into Web Technology | WSJ

How Social Bias Creeps Into Web Technology; Elizabeth Dwoskin; In The Wall Street Journal (WSJ); 2015-08-20.

Mentions

  • Offenses (exemplars)
    • Embarassment over machine-learning categorization failures.
      Photo labeling faliures:

      • African-ancestry person labeled as “apes”
      • African-ancestry person labeled as “gorilla”
      • Concentration camp labeled as “jungle gym”

      Ad targeting

      • Different ads shown to boys and girls.
      • The boys where shown “better” ads.
  • Perpetrators
    • Flickr, Yahoo, 2015-05.
    • Google, 2015-06.
  • Activists
    • Carnegie Mellon University
    • Andrew Selbst
    • Solon Barocas
    • Web Transparency and Accountability Project, Princeton University
  • Solon Barocas (Princeton University), Andrew D. Selbst (U.S. Court of Appeals); Big Data’s Disparate Impact; California Law Review, Vol. 104, 2016 (to appear); 62 pages; SSRN; separately filled.
  • Quoted
    for color, background & verisimilitude
    • Vivienne Ming, CTO, Guild Inc.; expert.
    • Adeyemi Ajao, vice president of technology strategy, Workday.
    • Andrew Selbst, U.S. Court of Appeals Third Circuit.
    • Paul Viola, ex-staff, Massachusetts Institute of Technology (MIT).
    • T.M. Ravi, co-founder and director, Hive (an incubator)
  • Remediations, vignette
    • <quote>Xerox Corp., for example, quit looking at job applicants’ commuting time even though software showed that customer-service employees with the shortest commutes were likely to keep their jobs at Xerox longer. Xerox managers ultimately decided that the information could put applicants from minority neighborhoods at a disadvantage in the hiring process.</quote>, uncited.

Machine Learning: The High-Interest Credit Card of Technical Debt | Sculley, Holt, Golovin, Davydov, Phillips, Ebner, Chaudhary, Young

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young (Google); Machine Learning: The High-Interest Credit Card of Technical Debt; In Some Venue; 2014?; 9 pages.

Abstract

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is re markably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

References

  • R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. In Proceedings of the 2013 International Conference on Management of Data (SIGMOD). 2013. pages 577–588. 2013. landing, ACM (paywall).
  • L. Bottou, J. Peters, J. Quiñonero Candela, D. X. Charles, D. M. Chickering, E. Portugaly, D. Ray, P. Simard, E. Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. In Journal of Machine Learning Research. 2013-11-14., 14(Nov), 2013. arXiv, YouTube.
  • William J. Brown, Ralph C. Malveau, Hays W. (“Skip”) McCormick, Thomas J. Mowbray, R. C. Malveau. Antipatterns: refactoring software, architectures, and projects in crisis.; Wiley. 1998. Amazon.
  • Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts. Refactoring: improving the design of existing code. Pearson Education India, Addison-Wesley. 1999. 337 pages. Amazon.
  • A. Lavoie, M. E. Otey, N. Ratliff, D. Sculley. History dependent domain adaptation. In Proceedings of the Domain Adaptation Workshop held at the Neural Information Processing Systems Conference (NIPS); 2011.
  • H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, J. Kubica. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD); 2013-08-11.
  • J. D. Morgenthaler, M. Gridnev, R. Sauciuc, S. Bhansali. Searching for build debt: Experiences managing technical debt at Google. In Proceedings of the Third International Workshop on Managing Technical Debt, 2012.
  • D. Sculley, M. E. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, Y. Zhou. Detecting adversarial advertisements in the wild. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2011. landing, ACM (paywall).
  • Securities and E. Commission. SEC Charges Knight Capital With Violations of Market Access Rule, 2013.
  • A. Spector, P. Norvig, S. Petrov. Google’s hybrid approach to research. In Communications of the ACM, Volume 55 Issue 7, 2012.

18 Articles on Big Data, 20 Tutorials at Data Science Central

38 Seminal Articles Every Data Scientist Should Read; Amy; In Data Science Central; 2014-08-15.

The listicle

  1. Bigtable: A Distributed Storage System for Structured Data
  2. Pedro Domingos; A Few Useful Things to Know about Machine Learning; In Communications of the ACM; 2012; 9 pages.
  3. Leo Breiman; Random Forests; In WHERE?; 2001-01; 33 pages.
  4. E. F. Codd; A Relational Model of Data for Large Shared Data Banks; In Communications of the ACM; Volume 13, Number 6; 1970-06; 11 pages.
  5. Cheng-Tao Chu, Sang Kyun Kim, YuanYuan Yu, Gary Bradski, Yi-An Lin, Andrew Y. Ng, Kunle Olukotun; Map-Reduce for Machine Learning on Multicore; In Proceedings of NIPS (WHERE?); 2006; 8 pages.
  6. Leo Breiman; Pasting Small Votes for Classification in Large Databases and On-Line; In European Journal of Mathematics; Volume 36, Issue 1-2; 1999-07; pages 85-103; paywalled
  7. Someone; Recommendations Item-to-Item Collaborative Filtering; In Proceedings of Some Conference; WHEN?
  8. Recursive Deep Models for Semantic Compositionality Over a Sentimen…
  9. Spanner: Google’s Globally-Distributed Database
  10. Megastore: Providing Scalable, Highly Available Storage for Interac…
  11. F1: A Distributed SQL Database That Scales
  12. APACHE DRILL: Interactive Ad-Hoc Analysis at Scale
  13. A New Approach to Linear Filtering and Prediction Problems
  14. Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang,
    Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu,
    Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg; Top 10 Algorithms on Data Mining; In Knowledge Information Systems; 2007-12-04, 2008; 37 pages; previously filled.
  15. The PageRank Citation Ranking: Bringing Order to the Web
  16. MapReduce: Simplified Data Processing on Large Clusters
  17. The Google File System
  18. Amazon’s Dynamo

Data Science Central Tutorials

  1. How to detect spurious correlations, and how to find the …
  2. Automated Data Science: Confidence Intervals
  3. 16 analytic disciplines compared to data science
  4. From the trenches: 360-degree data science
  5. 10 types of regressions. Which one to use?
  6. Practical illustration of Map-Reduce (Hadoop-style), on real data
  7. Jackknife logistic and linear regression for clustering and predict…
  8. A synthetic variance designed for Hadoop and big data
  9. Fast Combinatorial Feature Selection with New Definition of Predict…
  10. Internet topology mapping
  11. 11 Features any database, SQL or NoSQL, should have
  12. 10 Features all Dashboards Should Have
  13. Clustering idea for very large datasets
  14. Hidden decision trees revisited
  15. Correlation and R-Squared for Big Data
  16. What Map Reduce can’t do
  17. Excel for Big Data
  18. Fast clustering algorithms for massive datasets
  19. The curse of big data
  20. Interesting Data Science Application: Steganography

Big Data: New Tricks for Econometrics | Hal Varian

Hal R. Varian; Big Data: New Tricks for Econometrics; 2014-02-02; 36 pages.

Abstract

Nowadays computers are in the middle of most economic transactions. These “computer-mediated transactions” generate huge amounts of data, and new tools can be used to manipulate and analyze this data. This essay o ers a brief introduction to some of these tools and methods.