abusesaffiliationarrow-downarrow-leftarrow-rightarrow-upattack-typeblueskyburgerchevron-downchevron-leftchevron-rightchevron-upClock iconclosedeletedevelopment-povertydiscriminationdollardownloademailenvironmentexternal-linkfacebookfilterflaggenderglobeglobegroupshealthC4067174-3DD9-4B9E-AD64-284FDAAE6338@1xinformation-outlineinformationinstagraminvestment-trade-globalisationissueslabourlanguagesShapeCombined Shapeline, chart, up, arrow, graphLinkedInlocationmap-pinminusnewsorganisationotheroverviewpluspreviewArtboard 185profilerefreshIconnewssearchsecurityPathStock downStock steadyStock uptagticktooltiptriangletwitteruniversalitywebwhatsappxIcons / Social / YouTube

このページは 日本語 では利用できません。English で表示されています

記事

2025年7月18日

著者:
MIT Technology Review

AI training data set contains millions of personal records, researchers find

“A major AI training data set contains millions of examples of personal data”, July 18, 2025

Millions of images of passports, credit cards, birth certificates, and other documents with personally identifiable information are likely included in DataComp CommonPool, one of the largest open-source image generation training sets.

Thousands of images…were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from the web. Because the researchers audited just 0.1% of CommonPool’s data, they estimate that the real number of images containing personally identifiable information, including faces and identity documents, is in the hundreds of millions…

The researchers found thousands of instances of validated identity documents—including images of credit cards, driver’s licenses, passports, and birth certificates—as well as over 800 validated job application documents (including résumés and cover letters)...

A number of the résumés disclosed sensitive information including disability status, the results of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to people with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people…

…DataComp CommonPool was released in 2023 with 12.8 billion data samples... While its curators said that CommonPool...

“What we found is that ‘publicly available’ includes a lot of stuff that a lot of people might consider private... These are probably not things people want to just be used anywhere, for anything,” says Hong...