LLMs 'Would Not Exist' Without Reddit Data

Reddit CEO Steve Huffman mentioned massive language fashions “wouldn’t exist as we all know them” with out Reddit’s content material. He known as the platform’s user-generated knowledge “trendy oil” for AI.

Huffman made the feedback throughout an interview at Fast Company’s Most Innovative Companies Summit.

What Huffman Mentioned About Reddit’s Worth To AI

Huffman described the place Reddit’s knowledge holds within the AI ecosystem.

Huffman mentioned:

“LLMs wouldn’t exist as we all know them with out Reddit. Reddit is likely one of the single largest sources of coaching knowledge for the LLMs and Reddit continues to be one of many major sources of each coaching knowledge and we’re additionally probably the most cited, probably the most cited platform throughout all fashions.”

He attributed the quotation declare to Profound, a agency that tracks AI quotation knowledge.

Huffman defined why AI corporations rely upon the content material.

“There’s no synthetic intelligence with out precise intelligence. On the finish of the day, these fashions are fairly easy. They’re regurgitating on a fully huge scale what they’ve consumed elsewhere and a big portion of that consumption is definitely simply the human dialog on Reddit as a result of it’s pure and it covers principally each subject possible.”

Offers For Some, Lawsuits For Others

Reddit introduced knowledge licensing agreements with Google and OpenAI in 2024. Huffman referenced these as Reddit’s unique two AI knowledge offers and didn’t announce any extra agreements.

“Since we did the unique two offers with Google and OpenAI, that was over two years in the past, so we’ve discovered lots. They’ve discovered lots. The entire world’s discovered lots. Particularly how helpful Reddit’s knowledge is and the way helpful it’s. And so we’re being I believe very deliberate and selective there. However yeah, we’re open and open for enterprise.”

For corporations that haven’t agreed to licensing phrases, Reddit has taken authorized motion. The corporate sued Anthropic in California Superior Court docket, alleging unauthorized use of Reddit content material and violations of Reddit’s phrases. Reddit filed a federal lawsuit against Perplexity within the Southern District of New York, together with three data-scraping corporations, alleging DMCA anti-circumvention violations and associated claims.

Huffman drew a line between the 2 teams.

“Corporations like Google and OpenAI the place we had good relationships, we will truly do a deal and put some guard rails on use and entry to our knowledge on behalf of our customers however then collaborate on making merchandise for the subsequent era of the web.”

He added that “not each firm is prepared to be a collaborative companion and so sadly we’ve to go the opposite manner which is lawsuits.”

Huffman informed the viewers Reddit’s place on industrial use is straightforward. “Industrial use of our knowledge requires industrial phrases,” he mentioned. Reddit began charging for commercial API access in 2023, a transfer that preceded the present licensing offers.

Huffman mentioned Reddit nonetheless offers free knowledge entry to researchers and universities and tries to stay versatile for non-commercial use.

What Modified Reddit’s Openness

Based on Huffman, Reddit’s willingness to share knowledge freely modified when the AI trade moved away from open analysis. As SEJ previously reported, Reddit restricted entry for a lot of search engine crawlers whereas Google remained an exception.

“Traditionally, Reddit has been like we’re born of the open web and Reddit has been open and really permissive for entry to its knowledge. And truthfully, I believe we’d be in a unique place immediately if the AI corporations had been nonetheless principally open and open supply and doing open analysis.”

Huffman mentioned the difficulty was that Reddit couldn’t longer observe how its knowledge was getting used. “Individuals are utilizing our knowledge and we don’t know what it was getting used for,” he informed the viewers.

Past industrial phrases, Huffman mentioned Reddit needs to forestall its knowledge from getting used to determine customers, goal them with adverts, or to switch or disintermediate the platform.

Reddit’s Personal AI Efforts

Huffman acknowledged what he known as a “paradox.” Reddit’s content material powers exterior AI programs, however the firm additionally makes use of AI throughout its platform.

Probably the most seen product is Reddit Solutions, an LLM-powered search characteristic. It reads posts and feedback, then organizes them into responses constructed from verbatim consumer quotes. Huffman famous it’s designed for questions with out definitive solutions.

“What Reddit Solutions does is a few issues which can be distinctive to Reddit. One, it principally solely solutions in verbatim quotes from precise folks. After which the second factor it does is it tries to current a number of views as a result of the entire level when you’re on Reddit, you need the human perspective.”

Behind the scenes, Reddit makes use of AI for content material moderation and classification. LLMs can consider whether or not a remark crosses into bullying, one thing Huffman described as beforehand troublesome due to the subjectivity concerned.

Huffman introduced AI moderation as a strategy to cut back publicity to the worst content material, not as a alternative for Reddit’s group moderation mannequin.

“The worst job on the web was trying on the worst content material on the web and deciding whether or not it might be on-line or not,” Huffman mentioned. “That job simply goes away.”

The Grey Space Of AI-Written Posts

Huffman additionally addressed the problem of customers writing content material with AI instruments and pasting it into Reddit. That’s totally different from automated bot exercise, he burdened.

“Probably the most annoying factor that I see not simply on Reddit, however everywhere in the web is any person who wrote their submit or remark with ChatGPT after which pasted it into Reddit. Like, is {that a} bot? Definitely looks like a bot, however there’s a human behind the thought.”

Huffman solid the difficulty as one among intent. “It’s essential to us that there’s a human behind the thought, behind the content material, behind the immediate,” Huffman mentioned. However he additionally famous that “the writing sucks” when customers depend on AI to compose their posts.

Somewhat than making a coverage to deal with it, Huffman indicated Reddit will let its group deal with the difficulty. Customers are already downvoting AI-written content material and calling it out in feedback. Huffman mentioned Reddit will “empower the customers extra and the subreddits extra to only reject that kind of content material altogether.”

He in contrast the broader query to calculators in math class. “Youngsters lately are simply studying the best way to write with AI. What are we going to do about it?” he mentioned. “We sort of need to be taught, I believe, together with everyone else.”

Why This Issues

Huffman’s feedback reinforce Reddit’s pitch that its consumer discussions are a core enter for AI programs.

The AI-written content material drawback Huffman described is one SEJ covered as part of a broader YouTube AI slop investigation. Reddit’s determination to let group voting deal with AI-generated posts, fairly than constructing detection instruments, is a unique path than platforms which have deployed automated labeling.

Wanting Forward

Huffman informed Quick Firm that Reddit is “available in the market speaking to of us on a regular basis” about new knowledge offers, although he didn’t trace at a 3rd settlement.

Reddit’s lawsuits in opposition to Anthropic and Perplexity are each ongoing. The Anthropic case was the topic of a federal court docket remand listening to in March.

#LLMs #Exist #Reddit #Knowledge