Tumblr and WordPress are reportedly planning to sign a deal to sell user data to artificial intelligence companies OpenAI and Midjourney. 404 Media report The platform's parent company, Automattic, says it is nearing completion of a deal to provide data to help AI companies train their models.
Although it's not clear what data will be included, the report suggests that Automattic may have initially overstepped its bounds. An internal post by Tumblr product manager Cyle Gage suggests that Automattic was prepared to send private and partner-related data that wasn't supposed to be included in the deal. Suspicious content reportedly includes private posts to public blog posts, deleted or suspended blogs, unanswered (and therefore not public posts) questions, private answers, posts marked as explicit, It reportedly included content from premium partner blogs, including Apple's former music site.
Internal posts suggest that Automattic engineers are compiling a list of post IDs that should have been excluded. It is not clear whether the data had already been sent to the AI company.
Engadget emailed Automattic and asked for comment on the report. The company responded: published statement, claims that it „only shares public content hosted on WordPress.com and Tumblr from sites you haven't opted out of.“ The statement noted that legal regulations do not currently require AI companies' web crawlers to honor users' opt-out settings.
The last line of Automattic's statement appears to be consistent with the reported deal. “We also work directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control.” writes Automattic. “Our partnerships respect all opt-out settings. We also take this a step further by regularly updating our partners about people newly opting out and sharing that content with historical sources and future training. We also plan to request that it be removed from the site.”
The company reportedly plans to release a new opt-out tool on Wednesday that it claims will allow users to block data training by third parties, including AI companies. 404 Media We reviewed an internal FAQ that Automattic purportedly prepared for the tool. Some of the answers included: „If you opted out in the first place, we will block crawlers from accessing your content by adding your site to the block list. If you change your mind later, we will update our partners about newly opted-out people. We also plan to provide the content and request that it be removed from past sources and future training.”
The phrase “asking” AI companies to delete data may be relevant.
An internal memo in which Automattic's head of AI, Andrew Spittle, is said to have responded to staff questions about data deletion guarantees when using the tool explains: list. We want to make this an ongoing process and periodically advocate for excluding past content based on your current settings. Request that the content be removed from future training runs. Based on our conversations so far, I believe my partner will respect this. I don't think there will be much benefit overall if we keep it that way. ”
So if a Tumblr or WordPress user requests to opt out of AI training, Automattic will allegedly „request“ and „insist“ on its removal. And the company's AI chief „believes“ that „based on our conversations“ it is in the best interest of AI companies to comply. (That's safe!)
AI data training contracts have become a lucrative opportunity for websites to face today's landscape Slippery online publishing environment. (Tumblr staff, Become a skeleton crew member At the end of 2023. ) Last week, Google signed the following deal with Reddit (prior to the latter's IPO): Train with the platform's vast knowledge base of user-generated content. Meanwhile, OpenAI rolled out a partnership program last year, Collect datasets from third parties Helps train AI models.
Updated, February 27, 2024, 3:56 PM ET: This article has been updated to include a published statement from Automattic, the parent company of WordPress and Tumblr.