- The role of artificial intelligence (AI) in the job market and in hiring has been expanding rapidly, with algorithmic hiring systems proliferating.
- A survey found that 55% of human resources leaders in the U.S. use predictive algorithms in hiring.
- While some AI systems present opportunities to reduce systemic biases, others create new modes of discrimination.
- Algorithmic audits have been proposed as one such way to ensure those standards.
- Alex Engler from the The Brookings Institution explains how they work.
Algorithmic hiring systems are proliferating, and while some present opportunities to reduce systemic biases, others create new modes of discrimination. Broadly, the use of algorithms can enable fairer employment processes, but this is not guaranteed to be the case without meaningful standards. Algorithmic audits have been proposed as one such way to ensure those standards, and early examples of audits of algorithmic hiring systems have been released to the public. An examination of these audits and the relevant incentives demonstrates how algorithmic auditing will not produce accountability on its own. This paper presents steps toward specific standards of what constitutes an algorithmic audit and a path to enforce those with regulatory oversight.
The role of artificial intelligence (AI) in the job market and in hiring has been expanding rapidly—an industry survey found that 55% of human resources leaders in the United States use predictive algorithms in hiring. There are now algorithmic tools available for almost every stage of the employment process. Candidates can find job openings and recruiters can find candidates through sourcing platforms such as LinkedIn, Monster, and Indeed. Many companies employ algorithmic systems to analyze resumes, while another set of companies, such as JobScan and VMock, use algorithms to improve how resumes appear to other algorithms. Other companies, such as Pymetrics and PredictiveHire, create specialized questionnaires and assessments as inputs to AI to predict job performance. Firms may use AI to transcribe recorded statements to text, then analyze those textual responses with natural language processing. Some vendors, such as HireVue and TalView, have used facial analysis in these interviews—a task that goes far beyond the limits of what AI can do. Beyond these vendors, larger companies also build their own internal algorithmic tools to aid in hiring. Even on-the-job employee surveillance, which often uses data to partially determine retention, salaries, and promotion, can be considered part of this algorithmic funnel.
The common characteristic of these algorithmic tools is that they use data collected about a candidate to infer how well they might perform in a job, often using a subfield of AI called supervised machine learning. Companies build or procure these tools in order to reduce the duration and cost of hiring, as well as potentially improve workplace diversity and new hire performance, according to a thorough review of discrimination concerns in algorithmic employment by Upturn. For individual job candidates, the effects are not as obvious, and those who don’t fit neatly into algorithmic expectations could be left out. Further, the proliferation of algorithms across the hiring process is unprecedented, engendering concerns about how these systems are affecting the labor economy—especially around systemic discrimination.
Research has suggested there are many ways for bias to enter into algorithmic hiring processes. Machine learning models that predict metrics of workplace success, such as performance reviews or salary, may attribute good marks to competency, when they are in part the result of result of environmental factors unrelated to skill, such as historical workplaces biases.
“[T]he proliferation of algorithms across the hiring process is unprecedented, engendering concerns about how these systems are affecting the labor economy—especially around systemic discrimination.”
Models used to analyze natural language, such as in resumes or transcribed from interviews, have demonstrated biases against women and people with disabilities. Speech recognition models have demonstrated clear biases against African Americans and potential problems across dialectical and regional variations of speech. Commercial AI facial analysis, aside from being largely pseudoscientific, has shown clear disparities across skin color and is highly concerning for people with disabilities. Algorithms that disseminate job postings can unintentionally result in biased outcomes against young women for STEM jobs and similarly ageism against older candidates. The many documented instances of algorithmic biases are especially problematic in light of the long series of algorithmic evaluations detailed above—small biases in individual algorithms could easily accumulate to larger structural issues.
As demonstrated by an impactful paper that switched white sounding names and African American sounding names on otherwise identical resumes, humans also exhibit significant biases in hiring. In 2017, a meta-analysis of 28 studies observed “no change in the level of hiring discrimination against African Americans over the past 25 years” and a modest decline in discrimination against Latinos. Unfortunately, a thorough meta-analysis of psychological research shows that it is very difficult to mitigate implicit biases in human decision-making. Research also shows how diversity and inclusion interventions have not improved outcomes, such as for women in the technology industry. Other candidate evaluation tools currently in use, such as cognitive ability tests, can also exhibit dramatic racial biases resulting in disparate impact. Cognitive ability tests, including IQ tests, have played an extensive role in hiring processes for decades, despite significant evidence of their bias and relatively weak evidence of their efficacy. It is therefore critical to consider that hiring processes before algorithmic tools also exhibit systemic biases and should not necessarily be considered preferable.
Broadly speaking, using algorithms can, under certain circumstances and with careful implementation, reduce the prevalence of bias in a decision-making process. Bias can be reduced at technical levels within language models or by changing the outcome variable that is being predicted. Alternatively, changing how algorithmically-generated rankings are used can enable fair outcomes from an otherwise biased model. Specifically in hiring, algorithms have also been used to aid in debiasing specific aspects of the employment process, such helping to mask race, gender, and demographic information from applications and rewording job descriptions to attract more diverse candidates.
The 2019 paper “Discrimination in the Age of Algorithms” makes the argument for algorithms most holistically, concluding correctly that algorithms can be more transparent than human decision-making, and thus “have the potential to make important strides in combating discrimination.” The prevailing evidence supports this conclusion, although it needs more evaluation in applied settings. For instance, there have not yet been systemic analyses of the effects of employment algorithms in practice, due to their proprietary nature. Still, the deeply flawed status quo and potential capacity of algorithmic approaches to reduce bias warrant consideration of what sociotechnical systems and incentives would lead to the best outcomes.
It is important to consider how market incentives and governmental oversight affect algorithm development, because it is not obvious that best practices will otherwise prevail. In the absence of oversight or threat of litigation, there are good reasons to be skeptical that these employment models are typically rigorous in their approach to fairness. Most prominent of these reasons is that it is more expensive to take a thorough approach to fairness. It is time-consuming to task highly skilled data scientists and engineers with making robust and fair algorithmic processes, rather than building new features or delivering a model to a client. Further, it adds expense to collect more diverse and representative data to use in developing these models before deploying them. The scale at which these systems can operate is also a concern—these employment algorithms cumulatively affect tens of millions of people. This means that discriminatory outcomes can easily harm many thousands or even millions of people. Client companies, who procure these algorithmic systems to hire employees, may often be genuinely interested in fostering diversity through algorithmic hiring systems. However, it can be difficult to differentiate between the vendors of these systems in terms of non-discrimination, because all of them make strong claims about “unbiased models.” Algorithmic audits are one promising way to verify these claims about non-discrimination.
Algorithmic audits in employment
There is a growing interest in the role of algorithmic audits in response to concerns about bias in algorithmic systems. In an algorithmic audit, an independent party may evaluate an algorithmic system for bias, but also accuracy, robustness, interpretability, privacy characteristics, and other unintended consequences. The audit would identify problems and suggest improvements or alternatives to the developers. Beyond improving systems, algorithmic auditing can also help to build consumer and client trust if its results are made public. A cottage industry has arisen around this idea, including firms that specialize in algorithmic auditing, such as O’Neil Risk Consulting & Algorithmic Auditing (ORCAA) and Parity AI, and other companies focused on algorithmic monitoring, such as Fiddler, Arthur, and Weights & Biases.
While there is a long history of audits in computer security and database privacy, the idea of an algorithmic audit is relatively new and remains nebulously defined, though research is ongoing. The lack of a formal definition means there are myriad decisions that can be made within the bounds of an algorithmic audit. A vendor of algorithmic systems may have many different models in use, and thus an auditor may examine one model, many models of one type, or a random sample of some or all model types. The auditor might directly access the company’s data in order to run its own statistical tests, or it might request and take the statistics as provided by the audited company. The auditor might also directly examine the algorithmic models themselves to examine its features and test new use cases. Further the auditor might make its conclusions fully public, partially public, or only provide the results to the client company.
These choices decide whether the effect of an audit is private introspection, meaningful public accountability, or theater for public relations. Recent examples from the field demonstrate some of the possible outcomes.
- HireVue contracted with ORCAA to perform a highly limited audit of one of its assessments and to offer suggestions about other modeling practices. The ORCAA audit examined only HireVue’s documentation of one of its job candidate assessments. Based on those documents, the OCRAA audit determined that the assessment met a legal bar for nondiscrimination. The remainder of the ORCAA audit was a conversation with HireVue staff and external stakeholders to identify potential ways to improve HireVue’s tools. The audit did not independently analyze HireVue’s data or directly evaluate its models, but instead discussed ways to improve its approaches. As the audit notes, the audit was not representative of HireVue’s models, nor did it evaluate the types of models most likely to exhibit biased outcomes. Further, HireVue misrepresented the auditin a press release and placed the audit behind a nondisclosure agreement. Although the audit feasibly helped HireVue consider ways to improve its processes, the HireVue allowed very limited public transparency or accountability.
- A recent audit done jointly by Pymetrics and independent researchers from Northeastern University is much more thorough, analyzing documentation, data, and source code provided by Pymetrics. Without informing Pymetrics as to what the examination would entail, the auditors examined seven different models, some of which were randomly selected from recent clients. The audit used representative data from employees and candidates who have taken Pymetrics assessments, and also generated synthetic data to stress test the modeling code. The audit did establish that the modeling process met a legal bar for non-discrimination but did not disclose specific bias statistics of finalized models. While proprietary source code, data, and documentation remain under a nondisclosure agreement, the auditor retained the right to report publicly about the audit, and it is being published as a paper in a leading ethical AI conference. The auditors considered independence at length, including receiving funding in the form of a grant delivered before the results of the audit. While this audit does admit important limitations, it enables a meaningful level of public transparency.
The World Economic Forum’s Centre for the Fourth Industrial Revolution, in partnership with the UK government, has developed guidelines for more ethical and efficient government procurement of artificial intelligence (AI) technology. Governments across Europe, Latin America and the Middle East are piloting these guidelines to improve their AI procurement processes.
Our guidelines not only serve as a handy reference tool for governments looking to adopt AI technology, but also set baseline standards for effective, responsible public procurement and deployment of AI – standards that can be eventually adopted by industries.
We invite organizations that are interested in the future of AI and machine learning to get involved in this initiative. Read more about our impact.
The difference in both depth of analysis and level of resulting transparency is noteworthy, both of which were substantially better in the case of Pymetrics and the Northeastern researchers. Still, there are some benefits to both of these audits—the HireVue audit created marginally more transparency than entirely private introspection and could have fostered other changes in the algorithmic processes. As an alternative, consider when Amazon discovered their resume analysis tool was biased against women job candidates and eventually shut down the system. Amazon did not make this public and it was only discovered a few years later by an investigative reporter at Reuters. Aside from transparency, these early algorithmic audits demonstrate how smaller vendors might evaluate and improve their algorithmic services if they do not have in-house expertise on these topics.
The incentives of algorithmic auditing
An algorithmic audit will not automatically prevent the use of biased algorithms in employment. While the idea of auditing is associated with accountability, auditing does not automatically produce accountability. When an algorithmic hiring developer contracts with an algorithmic auditor, that auditor is providing a service for a client—a service on which the auditor is often financially dependent. That financial dependency can fundamentally undermine the public value of the audit. While an auditing company may be concerned about its public reputation or holding high professional standards, the need to sell their services will be the most important factor and will ultimately decide which auditors survive and which fail. Although both ORCAA and the Northeastern University researchers were paid for their auditing work, it is relevant that the more thorough audit was done by academic researchers with other means of financial support, and the less thorough audit was done by a company with a direct financial dependency. This encourages consideration of the incentives that would lead companies to choose and enable comprehensive audits, as well as the incentives for auditors to execute robust and critical audits.
“While the idea of auditing is associated with accountability, auditing does not automatically produce accountability.”
On this topic, important lessons can be taken from financial accounting, where three potential mechanisms stand out for their relevance to algorithmic auditing. These mechanisms are government enforcement that verifies those audits, market incentives that demand rigorous and thorough audits, and professional norms that hold individual auditors to a high standard. Unfortunately, these are all largely absent from algorithmic auditing, but could be enabled in ways analogous to financial accounting.
First, consider market incentives. If the clients of a financial auditing company keep releasing clean balance sheets until the moment they go bankrupt, the auditor will lose crucial trust. For example, consider the reputational damage that Ernst & Young has suffered from missing $2 billion of alleged fraud from its client, Wirecard. That Ernst & Young is taking such criticism is a good thing—it means that other financial auditing companies will take heed and avoid similar mistakes. Auditing firms work not only to verify public-facing financial statements, but also the internal controls and processes that prevent fraud. This means that shareholders and scrupulous executives have good reason to hire effective auditors, and financial auditors actively compete to excel in evaluations of auditor quality. Yet there is no parallel in algorithmic auditing, as algorithmic harms tend to be individualized and are hard to identify. Companies will not fail if their algorithms are discriminatory, and even when evidence of discrimination arises publicly, it is not clear that they face proportionate consequences.
Professional standards and codes of practice are also used to ensure rigor in professional auditing services. In financial accounting, a private sector organization, the Financial Accounting Standards Boards, enforces a common set of principles, the Generally Accepted Auditing Standards (GAAS), which detail how financial accounting should be audited by an independent organization. Critically, the GAAS constrain the number of choices that can be made within an audit. While there is still some flexibility and room for potential manipulation, the standards enable a more consistent evaluation of whether audits are performed thoroughly. Again, there are no equivalent guidelines in algorithmic accounting, which explains in part the huge difference in depth between the HireVue and Pymetrics audits. In fact, if a clear set of standards were established, the HireVue analysis may not even qualify as an audit, but instead be defined as a case study or review.
The government plays a role in financial accounting oversight in two critical ways. First, accountants are liable if they knowingly participate in, or enable through negligence, financial fraud, just like the perpetrating company. Auditors who deviate from the best practices in the GAAS standards have a more difficult time defending their actions as diligent and in good faith if they are sued by a client, or its investors and creditors, for allowing fraud. Once again, there is no civil liability established in algorithmic accounting—if a company was cleared of discriminatory effects by an auditor but was then successfully sued for discrimination in its algorithms, the auditor would not share that liability unless it was explicitly written into a contract. This scenario is also unlikely because proving algorithmic discrimination in a lawsuit may be highly difficult, as it would often require plaintiffs to get broad access to hiring data for many job candidates, not just their own case.
“Beyond civil liability, direct government oversight can be a mechanism for keeping audits honest.”
Beyond civil liability, direct government oversight can be a mechanism for keeping audits honest. In finance, a non-profit called the Public Company Accounting Oversight Board (PCAOB) performs this role. In practice however, the PCAOB has not made extensive use of its oversight powers, and its effect on the marketplace is uncertain. Tax accounting illustrates the role of government oversight more clearly—if clients of a certain tax accountant are routinely fined by the Internal Revenue Service, potential customers will know to go elsewhere. Yet there is also no clear parallel to IRS audits or the PCAOB oversight in algorithmic auditing.
Unfortunately, because none of these checks exist for algorithmic hiring systems, putting illegal or unethical activity into an algorithm can effectively shield it from scrutiny. However, there are clear changes that can be made to improve this situation. Creating a specific and robust definition of what constitutes an algorithmic audit, and then enabling government oversight and civil liability as enforcement mechanisms, would all greatly improve the status quo.
Defining an algorithmic audit
Defining what should be included in algorithmic audits for biases in employment systems can encourage more rigorous future audits and provide a benchmark of comparison for completed audits. Of course, this is not a task that can be done in a single analytical paper. This task will require a community effort with a range of expert contributions and buy-in from companies in this field, for instance through a professional standards organization. Further, many considerations will be specific to the specific modeling applications and how they are deployed. However, there are some important criteria that will often apply. These can act as a starting point for these standards and can help evaluate the quality of voluntary audits performed before such standards exist.
Auditing process considerations
- Auditor independence: An algorithmic audit should consciously structure contracts to maintain independence of the auditor’s results. The auditor should maintain complete independence in public reporting and dissemination of results of the audit, with only specific and narrow exceptions agreed to in advance related to proprietary information. The auditor should not disclose details of the investigatory process to the audited company. Further, payment for the audit should be delivered before the return of its results, and the auditor should have no other financial relationship with the company.
- Representative analysis: An algorithmic audit does not need to look at every deployed model for every client of an algorithmic employment vendor, but it does need to look at enough models to ensure that a fair process is being implemented. This might simply entail a random sample of all models, however a stratified random sampling approach might also be warranted. In a stratified random sample, an auditor might take random samples of different types of models (e.g., those based on games versus open-ended questions) or different job categories (e.g., models for salespeople versus customer service representatives).
- Data, code, and model access: An algorithmic audit must have unrestricted access to some combination of the data, code, and trained models such as to directly evaluate these systems. This entails that covered companies would need to retain the deployed versions of models and relevant data for the future audit. In most cases, a representative sample of data and source code would be sufficient, as those can be used to train the models, however also examining deployed models would be a more thorough approach. Notably, the proprietary data, code, and models do not need to be made public, as that poses privacy risks and an undue competitive disadvantage to the audited company.
- Consider adversarial actions: Auditors should generally not assume good faith by the audited companies. To the extent possible, they should examine provided data, code, and models for possible manipulation and consider other avenues to verify provided information.
Modeling dependencies and documentation
- Data collection process: An algorithmic audit should actively consider the data collection and cleaning process, especially including any exclusion of observations, top-coding of outliers, consolidation of categorical variables, or missing data imputation and their possible effects on model training and outcomes.
- Training data representativeness:Datasets used for training machine learning models should be evaluated for representativeness to the extent that demographic data is available. For instance, in some employment algorithms it will be important for the training data not just to be diverse in applicants, but to also have many examples of successful hires from various subgroups.
- Dependency analysis:An algorithmic audit should consider the effects of, and report on, the software packages and libraries that the audited company’s algorithmic system builds on. Pretrained machine learning models used in or adapted for an algorithmic hiring system, especially those for facial analysis, voice analysis, speech transcription, language translation, and natural language processing, should be carefully considered for downstream ramifications.
- Documentation review: An algorithmic audit should review the documentation for models and ensure that the documentation accurately communicates the functionality of the models. This is especially critical in two scenarios: (1) in any situation in which a client company is receiving models or model outputs, as the client’s interpretation of the model outputs is crucial and itself can lead to biased outcomes; (2) in advertising claims and other public reporting, such as model cards or data nutrition labels.
- Candidate rankings: An algorithmic audit should thoroughly examine the job candidate rankings and associated scores produced by an algorithmic system. These scores are the most important outcome of interest because they generally determine who moves onto the next stage of a hiring process. So it is critical that an auditor evaluates how the algorithmic process generates rankings for new job applicants, examining the rate of selection between subgroups and related statistical measures, some of which are available in open-source such as the Audit-AI package. An auditor should be more specific than merely measuring whether the system meets legal criteria, such as the “substantially different rate of selection” and should report specific statistics that are representative of the models tested.
- Performance on historical data: An algorithmic audit should also consider how trained models performed on the data used to develop the model. This differs from looking solely at candidate rankings, because model scores can be compared to who was actually hired in historical data. Audits should report on relevant fairness metrics, and can look to tools such as Aequitas and AI Fairness 360. For example, metrics of equal opportunitymay be highly relevant—these statistics measure whether the model gives fair scores to successfully hired employees across different subgroups. While some models will likely not be able to perform perfectly fairly, there is still value in reporting these outcomes to enable comparisons across algorithmic systems.
- Subgroup considerations: An algorithmic audit should examine models for the relevance of variables denoting subgroups, especially protected characteristics. The audit should examine whether subgroups are included as inputs in the model or whether other included variables serve as proxies for those subgroups. Further, the audit should examine subgroups beyond the required scope of law, considering intersectionality (e.g., the overlap of gender and race) and subracial groups (e.g., disaggregating Middle Eastern from white) when possible.
- Hypothetical data: Auditors can generate synthetic data to stress test the model building process. This process could check to see if automated or human-in-the-loop oversight would catch a problematically trained model. Further, it could also help examine what might happen to individuals who fall outside the distributions of the data on which the models were trained, which may be especially true for people with disabilities.
- Problem definition: An algorithmic audit should carefully consider outcome variables chosen and how that might enable biases. Many metrics that suggest employee success can be partially driven by preferential treatment ingrained in workplace culture. For instance, the total sales of a salesperson might be more a product of the quality of leads they receive than their quality as an employee. In this case, the model’s outcome variable is misaligned with the quality of interest, which can often lead to biased outcomes, and the auditor might suggest an alternative metric.
- Frequency: The algorithmic audit should consider how models update over time, and what that entails for model drift, especially possibility of degradation of performance. This consideration should lead to a recommendation in how often a modeling audit would need to occur.
These criteria are meant as a starting point for a broader discussion. They likely neither apply in all situations, nor are fully sufficient, and additional discussion and investment is warranted. As other analyses have noted, going beyond technical implementations to a wider evaluation of stakeholder needs may also be an important role for algorithmic audits, though that is less likely to be enforced by law. Further, bias is not the only concern of employment algorithms, and audits may also consider how the models can be made more transparent to candidates or secure in their protection of sensitive data.
Implications for federal governance
Creating formal and robust standards for algorithmic audits is an important step, but it must still be backed up by governmental oversight. In the United States, the Equal Employment Opportunity Commission (EEOC) should consider what specific steps are needed to execute algorithmic audits of this kind. It appears this work was started in late 2016, but it is not clear that it continued under the Trump administration (though other relevant work has). The Department of Labor’s Office of Federal Contract Compliance could also enforce new rules, ensuring that federal government only enabled contracts with algorithmic hiring companies that produced compelling evidence of unbiased outcomes.
“Creating formal and robust standards for algorithmic audits is an important step, but it must still be backed up by governmental oversight.”
As I have argued before, the Biden administration should tackle these challenges in earnest. For the EEOC, this includes exploring the limitations in agency authority to access corporate data when responding to charges of employment discrimination. In fact, 10 Senators recently wrote to the EEOC, specifically asking about its ability to investigate companies that build and deploy employment algorithms. This is a critical question, as little rigorous analysis can take place without the government gaining access to the data used for these algorithmic systems. The relevant agencies might also need new technical expertise, and it was encouraging to see the EEOC open job postings for two data scientists as part of an important effort to improve federal hiring processes.
In addition to building the capacity to do algorithmic audits, the EEOC needs to revisit existing policies and draft new guidance to adapt to the proliferation of algorithmic systems. This includes revisiting the idea of predictive validity in the EEOC’s Uniform Guidelines on Employee Selection. As Manish Raghaven and Solon Barocas have argued, the Uniform Guidelines currently enable algorithmic developers to defend against charges of discriminatory effects by demonstrating that their tool is predictive, even if it is also discriminatory. This is an unreasonable standard, because these systems typically employ supervised machine learning, which entails they are definitionally predictive. This provision needs to change, or little to no oversight can occur. Further, the EEOC should clarify that, when it performs an audit of algorithmic hiring systems, it will look at a broader range of metrics than just differences in rates of selection across subgroup, which has long been used as a rule-of-thumb for discriminatory impact. It is also necessary to consider requirements for the storage of data, code, and models by the regulated companies, so that they can be audited. The EEOC should also consider other ways to better enable civil liability as a mechanism for enforcing the fair use of algorithms.
The goal of government intervention in this field is not to make the labor market perform perfectly, but to meaningfully raise the floor of quality in algorithmic employee assessment. By holding companies to a consistent and higher standard, responsible companies that invest time and money into deployment of fair algorithms will be better able to compete, while unscrupulous companies face fines and lawsuits. Algorithmic audits alone are likely insufficient to prevent abuses, and other interventions warrant consideration. For instance, a total ban on using facial affect analysis for employment services is worth consideration. Further, these audits may not clearly apply to the online job sourcing platforms, such as Monster, LinkedIn, and Indeed. Given the importance of those companies to the labor marker, working to open up their data to independent researcher access may be a valuable step.
If the government is able to create meaningful regulatory oversight, algorithmic audits would become much more impactful. This would lead to stronger incentives for investing in fairer algorithmic systems, and the world of algorithmic hiring might improve on the pervasive biases present in candidate review processes done by humans or outdated cognitive assessments. If this were to become the case, the EEOC could even highlight best practices in the field and urge employers to move toward algorithmically sound and responsible candidate selection processes. With these changes, the proliferation of algorithms in employment systems could realize its potential and significantly reduce systemic biases that have disadvantaged many individuals and undermined the broader labor market.