tl;dr → There be dragons. Princeton was is there. Tell it! Testify!
When you browse the web, hidden “third parties” collect a large amount of data about your behavior. This data feeds algorithms to target ads to you, tailor your news recommendations, and sometimes vary prices of online products. The network of trackers comprises hundreds of entities, but consumers have little awareness of its pervasiveness and sophistication. This chapter discusses the findings and experiences of the Princeton Web Transparency Project, which continually monitors the web to uncover what user data companies collect, how they collect it, and what they do with it. We do this via a largely automated monthly “census” of the top 1 million websites, in effect “tracking the trackers”. Our tools and findings have proven useful to regulators and investigatory journalists, and have led to greater public awareness, the cessation of some privacy-infringing practices, and the creation of new consumer privacy tools. But the work raises many new questions. For example, should we hold websites accountable for the privacy breaches caused by third parties? The chapter concludes with a discussion of such tricky issues and makes recommendations for public policy and regulation of privacy.
Levin S (2016) A beauty contest was judged by AI and the robots didn’t like dark skin. The Guardian
Solove DJ (2001) Privacy and power: Computer databases and metaphors for information privacy. Stanford Law Review pp 1393–1462
Marthews A, Tucker C (2015) Government surveillance and internet search behavior. ssrn:2412564
Hannak A, Soeller G, Lazer D, Mislove A, Wilson C (2014) Measuring price discrimination and steering on e-commerce web sites. In: Proceedings of the 2014 Conference on Internet Measurement Conference, ACM, pp 305–318
Calo R (2013) Digital market manipulation. University of Washington School of Law Research Paper 2013-27 DOI 10.2139/ssrn.2309703 ssrn:2309703
Mayer JR, Mitchell JC (2012) Third-party web tracking: Policy and technology. In: Proceedings of the 2012 IEEE Symposium on Security and Privacy, IEEE, pp 413–427
Lerner A, Simpson AK, Kohno T, Roesner F (2016) Internet jones and the raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016. In: Proceedings of the 25th USENIX Security Symposium (USENIX Security 16)
Laperdrix P, Rudametkin W, Baudry B (2016) Beauty and the beast: Diverting modern web browsers to build unique browser fingerprints. In: Proceedings of the 37th IEEE Symposium on Security and Privacy (S&P 2016)
Eckersley P (2010) How unique is your web browser? In: International Symposium on Privacy Enhancing Technologies Symposium, Springer, pp 1–18
Acar G, Juarez M, Nikiforakis N, Diaz C, Gürses S, Piessens F, Preneel B (2013) Fpdetective: dusting the web for fingerprinters. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, ACM, pp 1129–1140
Englehardt S, Narayanan A (2016) Online tracking: A 1-million-site measurement and analysis. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer & Communications Security
Acar G, Eubank C, Englehardt S, Juarez M, Narayanan A, Diaz C (2014) The web never forgets. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14) DOI 10.1145/2660267.2660347
Mowery K, Shacham H (2012) Pixel perfect: Fingerprinting canvas in html5. In Proceedings of W2SP
(Valve) VV (2016) Fingerprintjs2 — modern & flexible browser fingerprinting library, a successor to the original fingerprintjs.
Olejnik Ł, Acar G, Castelluccia C, Diaz C (2015) The leaking battery. In: International Workshop on Data Privacy Management, Springer, pp 254–263
Englehardt S, Narayanan A (2016) Online tracking: A 1-million-site measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ’16)
Soltani A, Peterson A, Gellman B (2013) NSA uses Google cookies to pinpoint targets for hacking. In Washingtno Post.
Englehardt S, Reisman D, Eubank C, Zimmerman P, Mayer J, Narayanan A, Felten EW (2015) Cookies that give you away. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15) DOI 10.1145/2736277.2741679
Angwin J (2016) Google has quietly dropped ban on personally identifiable web tracking. ProPublica
Angwin J (2014) Why online tracking is getting creepier. ProPublica
Vallina-Rodriguez N, Sundaresan S, Kreibich C, Paxson V (2015) Header enrichment or ISP enrichment? In Proceedings of the 2015 ACM SIGCOMM Workshop on Hot Topics in Middleboxes and Network Function Virtualization (HotMiddlebox ’15) DOI 10.1145/2785989.2786002
Vanrykel E, Acar G, Herrmann M, Diaz C (2016) Leaky birds: Exploiting mobile application traffic for surveillance. In Proceedings of Financial Cryptography and Data Security 2016
Lécuyer M, Ducoffe G, Lan F, Papancea A, Petsios T, Spahn R, Chaintreau A, Geambasu R (2014) Xray: Enhancing the webs transparency with differential correlation. In: Proceedings of the 23rd USENIX Security Symposium (USENIX Security 14), pp 49–64
Lecuyer M, Spahn R, Spiliopolous Y, Chaintreau A, Geambasu R, Hsu D (2015) Sunlight: Fine-grained targeting detection at scale with statistical confidence. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, pp 554–566
Tschantz MC, Datta A, Datta A, Wing JM (2015) A methodology for information flow experiments. In: Proceedings of the 2015 IEEE 28th Computer Security Foundations Symposium, IEEE, pp 554–568
Datta A, Sen S, Zick Y (2016) Algorithmic transparency via quantitative input influence. In: Proceedings of 37th IEEE Symposium on Security and Privacy
Chen L, Mislove A, Wilson C (2015) Peeking beneath the hood of uber. In: Proceedings of the 2015 ACM Conference on Internet Measurement Conference, ACM, pp 495–508
Valentino-Devries J, Singer-Vine J, Soltani A (2012) Websites vary prices, deals based on users information. In The Wall Street Journal 10:60–68
Rastogi V, Chen Y, Enck W (2013) Appsplayground: automatic security analysis of smartphone applications. In: Proceedings of the third ACM Conference on Data and Application Security and Privacy, ACM, pp 209–220
Enck W, Gilbert P, Han S, Tendulkar V, Chun BG, Cox LP, Jung J, McDaniel P, Sheth AN (2014) Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In ACM Transactions on Computer Systems (TOCS) 32(2):5
Ren J, Rao A, Lindorfer M, Legout A, Choffnes D (2015) Recon: Revealing and controlling privacy leaks in mobile network traffic. arXiv:1507.00255.
Razaghpanah A, Vallina-Rodriguez N, Sundaresan S, Kreibich C, Gill P, Allman M, Paxson V (2015) Haystack: in situ mobile traffic analysis in user space. arXiv:1510.01419.
Sweeney L (2013) Discrimination in online ad delivery. Queue 11(3):10
Caliskan-Islam A, Bryson J, Narayanan A (2016) Semantics derived auto-matically from language corpora necessarily contain human biases. arxiv:1608.07187.
INFO 198/BIOL 106B (callingbullshit.org) – “Calling Bullshit in the Age of Big Data,” a course, University of Washington (Washington State, that is, located in Seattle WA). Instructors: Jevin West (iinformation), Carl Bergstrom (biology)
Anthony J. Casey, Anthony Niblett; The Death of Rules and Standards; Coase-Sandor Working Paper Series in Law and Economics No. 738; Law School, University of Chicago; 2015; 58 pages; landing, copy, ssrn:2693826, draft.
Scholars have examined the lawmakers’ choice between rules and standards for decades. This paper, however, explores the possibility of a new form of law that renders that choice unnecessary. Advances in technology (such as big data and artificial intelligence) will give rise to this new form – the micro-directive – which will provide the benefits of both rules and standards without the costs of either.
Lawmakers will be able to use predictive and communication technologies to enact complex legislative goals that are translated by machines into a vast catalog of simple commands for all possible scenarios. When an individual citizen faces a legal choice, the machine will select from the catalog and communicate to that individual the precise context-specific command (the micro-directive) necessary for compliance. In this way, law will be able to adapt to a wide array of situations and direct precise citizen behavior without further legislative or judicial action. A micro-directive, like a rule, provides a clear instruction to a citizen on how to comply with the law. But, like a standard, a micro-directive is tailored to and adapts to each and every context.
While predictive technologies such as big data have already introduced a trend toward personalized default rules, in this paper we suggest that this is only a small part of a larger trend toward context- specific laws that can adapt to any situation. As that trend continues, the fundamental cost trade-off between rules and standards will disappear, changing the way society structures and thinks about law.
tl;dr → Sortof. Given data from prior 24 hrs to the election, in swing states, then the model 80% accurate. It takes a year (post election) to identify the existence such a model for 2012.
The days of surprise about actual election outcomes in the big data world are likely to be fewer in the years ahead, at least to those who may have access to such data. In this paper we highlight the potential for forecasting the Unites States presidential election outcomes at the state and county levels based solely on the data about viewership of television programs. A key consideration for relevance is that given the infrequent nature of elections, such models are useful only if they can be trained using recent data on viewership. However, the target variable (election outcome) is usually not known until the election is over. Related to this, we show here that such models may be trained with the television viewership data in the “safe” states (the ones where the outcome can be assumed even in the days preceding elections) to potentially forecast the outcomes in the swing states. In addition to their potential to forecast, these models could also help campaigns target programs for advertisements. Nearly two billion dollars were spent on television advertising in the 2012 presidential race, suggesting potential for big data–driven optimization of campaign spending.
tl;dr → alternative scoring products, propensity scoring, (not-)credit reports.
For color, background & verisimilitude
Douglas Merrill, founder and chief executive, ZestFinance
Asim Khwaja, professor of international finance and development, Kennedy School, Harvard University.
Chi Chi Wu, attorney, National Consumer Law Center.
Teresa Jackson, vice president of credit, Social Fiance (SoFi).
Alfonso Brigham, exemplar; customer of Social Fiance (SoFi).
Phil Marleau, CEO, IOU Financial.
Eric Haller, executive vice president, Experian Data Labs.
Basix, a lender
a holding company
owns & operates Basix
Douglas Merrill, founder and chief executive
“All data is credit data”
<quote>ZestFinance collects thousands of pieces of consumer information — some submitted in an online application, some obtained from data brokers — and runs them through algorithms that judge how likely it is a borrower will repay.</quote>
founder and chief executive, ZestFinance
ex-Google, role unspecified.
ex-Rand Corp, a research role.
Social Finance (SoFi)
Social Finance (SoFi)
San Francisco, CA
a lender (a loan broker?)
backgrounds in finance, software and business consulting.
Teresa Jackson, vice president of credit, SoFi.
$1B (with a ‘b’)
does not monitor social media
bachelor’s degree, business administration, USC 2005.
has a job
acquired for a mortgage
one-bedroom condo in Nob Hill, San Francisco
publicly traded (where?)
a lender (a loan broker?)
monitor social media
count & correlate bad reviews
Phil Marleau, CEO
Experian Data Labs, “a research unit”
San Diego, CA
Eric Haller, executive vice president, Experian Data Labs
monitor social media
<quote>The firm’s data scientists took business credit information and combined it with information from Twitter, Facebook, Yelp and others. Based on that analysis, the firm is working on a credit-scoring system that could be based solely on social media information.</quote>
Solon Barocas (Princeton University), Andrew D. Selbst (U.S. Court of Appeals); Big Data’s Disparate Impact; California Law Review, Vol. 104, 2016 (to appear); 62 pages; ssrn:2477899; 2015-08-14.
Big data claims to be neutral. It isn’t.
Advocates of algorithmic techniques like data mining argue that they eliminate human biases from the decision-making process. But an algorithm is only as good as the data it works with. Data mining can inherit the prejudices of prior decision-makers or reflect the widespread biases that persist in society at large. Often, the “patterns” it discovers are simply preexisting societal patterns of inequality and exclusion. Unthinking reliance on data mining can deny members of vulnerable groups full participation in society. Worse still, because the resulting discrimination is almost always an unintentional emergent property of the algorithm’s use rather than a conscious choice by its programmers, it can be unusually hard to identify the source of the problem or to explain it to a court.
This Article examines these concerns through the lens of American anti-discrimination law — more particularly, through Title VII’s prohibition on discrimination in employment. In the absence of a demonstrable intent to discriminate, the best doctrinal hope for data mining’s victims would seem to lie in disparate impact doctrine. Case law and the EEOC’s Uniform Guidelines, though, hold that a practice can be justified as a business necessity where its outcomes are predictive of future employment outcomes, and data mining is specifically designed to find such statistical correlations. As a result, Title VII would appear to bless its use, even though the correlations it discovers will often reflect historic patterns of prejudice, others’ discrimination against members of vulnerable groups, or flaws in the underlying data.
Addressing the sources of this unintentional discrimination and remedying the corresponding deficiencies in the law will be difficult technically, difficult legally, and difficult politically. There are a number of practical limits to what can be accomplished computationally. For example, where the discrimination occurs because the data being mined is itself a result of past intentional discrimination, there is frequently no obvious method to adjust historical data to rid it of this taint. Corrective measures that alter the results of the data mining after it is complete would tread on legally and politically disputed terrain. These challenges for reform throw into stark relief the tension between the two major theories underlying anti-discrimination law: nondiscrimination and anti-subordination. Finding a solution to big data’s disparate impact will require more than best efforts to stamp out prejudice and bias; it will require wholesale reexamination of the meanings of “discrimination” and “fairness.”
Table of Contents
HOW DATA MINING DISCRIMINATES
Defining the “Target Variable” and “Class Labels”
TITLE VII LIABILITY FOR DISCRIMINATORY DATA MINING