Data Matters—Risks and Best Practices for Use of Generative AI

The Legal Intelligencer

Businesses are racing to unlock the value of generative artificial intelligence (AI) tools. Broadly speaking, generative AI refers to AI tools that create new content like text, images, and videos from user prompts, including tools like OpenAI’s ChatGPT, Google’s Bard, and Stability AI’s Stable Diffusion. Businesses can use generative AI to build, edit, and test software code, write ad copy, and create better customer service and support experiences with enhanced chatbot capabilities. But use and deployment of generative AI brings privacy and intellectual property risks, and risk of harm to consumers resulting from inherent bias and unfair or deceptive commercial practices. As regulatory scrutiny and enforcement gain momentum and plaintiffs begin to seek redress for alleged harms caused by the use of generative AI, risks increase. Accordingly, businesses should operationalize best practices to help mitigate that risk.

Privacy Issues

Training generative AI tools requires massive data sets. Common sources of data include data scraped from the internet and licensed data sets, which can include personal information.

Some AI tools allow businesses to further train the tool with their own data sets, which may also include personal information. For example, training an AI tool to enhance customer support delivery could involve inputting historical customer support journey and outcome data. This data may include customer and support personnel names, and customer account data such as contact information, demographics, and purchase information.

Using personal information to train AI models makes it challenging to comply with generally accepted privacy principles that underpin many U.S. and global privacy laws, including data minimization, purpose specification and transparency. For example, certain state laws, including the comprehensive privacy laws of California, Virginia and Colorado, require meaningful notice be provided to consumers at or before the collection of personal information. The notice must specify the type data being collected and the purposes for which it is being used. Notices prepared prior to deployment of an AI tool may not cover the secondary AI tool use case. Business teams also commonly take a “bigger is better” approach when selecting data sets for training AI tools. This may result in use of irrelevant data, which is contrary to a business’s obligation to collect and use only the minimum personal information required for its specific business purposes.

Challenges are compounded when businesses use personal information from publicly available data or third-party licensed data to enhance data sets. Individuals to which this personal information relates frequently do not have a direct relationship with the business and may not know their personal information is being used, making it difficult for the business to meet its transparency obligations.

Businesses may also encounter difficulty responding to individual data rights requests, which are available to residents of some U.S. states and to individuals in the EU and elsewhere. An individual’s data may be difficult to isolate and delete or modify once it has been input as training data. Moreover, engineering a rights request response process may be significantly more difficult after the AI tool is deployed than while it is being developed.

Despite these challenges, compliance with privacy obligations is imperative to protect the business’s investment in AI. The Federal Trade Commission (FTC) has used its injunctive relief powers to compel businesses to both delete personal data that has been obtained in violation of privacy laws and delete any algorithmic tool that may have been developed using the illegally obtained data. See FTC Stipulated Order for Permanent Injunction, Civil Penalty Judgment and Other Relief against WW International, Inc., available at:

Intellectual Property Issues

Intellectual property risks arise with selection of AI tool input and use of its output, including potential infringement of third-party intellectual property, challenges in protecting output with traditional intellectual property rights, and loss of control of proprietary rights.

When massive data sets are scraped from the internet to train AI tools, invariably copyrighted content will be captured. For example, publicly available open source code may be used to train AI tools. Open source code is generally subject to license terms that may require, among other things, attribution to the authors and the reproduction of certain copyright and other notices. If the generative AI tool produces code in response to a business user request that includes portions of the scraped open source code but fails to include the attribution and notice, reproduction of that code may be a violation of the license terms and a copyright infringement. Similar claims have begun winding their way through the courts. See Doe 1 v. GitHub, Case No. 3:22-cv-06823 (N.D. Cal). Unfortunately, it may be difficult or impossible for users of the generative AI tool to determine whether the output of the tool contains protected works.

Once the output is delivered, a business may find it difficult to protect the output as its own intellectual property. The U.S. Copyright Office has issued guidance stating that AI-generated content is ineligible for copyright protection, except to the extent the content has been modified by a human author. Similarly, the U.S. Court of Appeals for the Federal Circuit has ruled that AI cannot be an inventor for purposes of patent protection. See Thaler v. Vidal, 43 F4th 1207 (Fed. Cir. 2022), cert. denied, No. 22-919 (Apr. 24, 2023).

Businesses must also manage risks related to data used as input for the AI tool. Input may often be used by the creator of the AI tool to further improve and train the tool. Competitors or, in the case of open AI tools, the world, may then have access to a tool trained by the business’s proprietary data. Proprietary information could also be accessed directly if it’s provided as part of the tool’s output.

Disclosure of proprietary information may also result in ineligibility for trade secret or patent protection. Trade secrets derive protection from the fact that they are secret and that reasonable efforts are used to keep them secret. When such information must be disclosed, reasonable efforts typically include use of a confidentiality agreement. Inputting trade secret information may constitute disclosure to a third party without the requisite confidentiality guarantees, resulting in a potential loss of trade secret protection.

Similarly, input describing a patentable invention may constitute public disclosure that could result in a surrender of patent rights if appropriate filings are not submitted within a certain period after that disclosure. Without appropriate policies and controls, a business may not even be aware that the invention was disclosed to the AI tool and consequently may not timely submit filings.

Bias, Misrepresentation and Other Consumer Harms

Jurisdictions have begun regulating use of AI where bias may cause significant harms and regulatory agencies have stated that they will scrutinize use of automated systems, including AI, that may contribute to bias and discriminatory outcomes in housing, access to credit, employment, and other areas. See New York City Local Law 144 of 2021 and Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems, available at: // All algorithms and AI tools should be regularly reviewed to identify any disparate impact on consumers, particularly disparate impact based on protected categories under applicable federal and state law.

Misrepresentation can occur when businesses overestimate and misrepresent the capabilities of an AI tool or disseminate output of a generative AI tool that is incorrect. Human review of results is imperative to prevent dissemination of AI output that is false and misleading in violation of the FTC Act and related state statutes.

Best Practices

Create robust governance structures: Effective governance and oversight by a multidisciplinary and diverse team helps foster responsible development, deployment, and use of AI tools. Develop governance functions to identify, understand, and measure risk relating to AI use, and to manage that risk consistent with the risk profile of the business.

Implement clear use policies: Employees should have clear guidance on what use cases are allowed and prohibited. For example, a business may prohibit AI use in areas impacting access to credit and employment without an appropriate level of review and approval. If vendor AI tools are used for these use cases, vendors should be required to show the tools have been reviewed for bias.

Limit use of personal data and sensitive business information: If possible, input should be de-identified. Use of personal data should be limited to what is minimally necessary for the business purpose. Input of information classified as sensitive or restricted by the business should be prohibited without an appropriate level of review and approval.

Diligence vendors and review contractual terms: Terms should be reviewed so the business understands how the AI tool provider uses input. Businesses should consider configurations or offerings that limit secondary uses of input data, if available.

Review privacy notices: Businesses must provide accurate and complete disclosures. If AI use cases expand collection and use, notices to individuals should be modified to reflect such use.

Ensure human oversight: All results should be reviewed by a human being for accuracy and suitability.

"Data Matters—Risks and Best Practices for Use of Generative AI," by Sharon R. Klein, Alex C. Nisenbaum and Karen H. Shin was published in The Legal Intelligencer on July 2, 2023.

Reprinted with permission from the July 2, 2023, edition of The Legal Intelligencer © 2023 ALM Properties, Inc. All rights reserved. Further duplication without permission is prohibited.