Skip to main content

Recent restrictions on data scraping don’t have to derail your generative AI initiatives

Businesses and developers building generative AI models got some bad news this summer. Twitter, Reddit and other social media networks announced that they would either stop providing access to their data, cap the amount of data that could be scraped or start charging for the privilege. Predictably, the news set the internet on fire, even sparking a sitewide revolt from Reddit users who protested the change. Nevertheless, the tech giants carried on and, over the past several months, have started implementing new data policies that severely restrict data mining on their sites.

Fear not, developers and data scientists. The sky is not falling. Don’t hand over your corporate credit cards just yet. There are other, more relevant ways for organizations to empower their employees with other sources of data and keep their data-driven initiatives from being derailed.

The Big Data Opportunity in Generative AI

The billions of human-to-human interactions that take place on these sites have always been a gold mine for developers who need an enormous dataset in which to train AI models. Without access (or without affordable access), developers would have to find another source of this type of data or risk using incomplete data sets for training their models. Social media sites know what they have and are looking to cash in.

And, honestly, who can blame them? We’ve all heard the quip that data is the new oil, and generative AI’s rise is the most accurate example of that truism I’ve seen in a long time. Companies that control access to large datasets hold the key to creating the next-generation AI engines that will soon radically change the world. There are billions of dollars to be made, and Twitter, Reddit, Meta and other social media sites want their share of the pie. It’s understandable, and they have that right.

So, What Can Organizations Do Now?

Developers and engineers are going to have to adapt their data use and collection in this new environment. This requires new controllable sources of data, as well as new data use policies that can ensure the resiliency of this data. The good news is that most enterprises are already collecting this data. It lives in the thousands of customer interactions that occur inside their organization every day. It’s in the reams of research data that went toward years of development. It’s in the day-to-day interactions between employees and with partners as they go about their business. All the data in your organization can and should be used to train new generative AI models.

While scraping data from across the internet provides a sense of scale that would be impossible for a single organization to achieve, the result of general data scraping is that it produces generic outputs. Look at ChatGPT. Every answer is a mishmash of broad generalities and corporate speak that seems to say a whole lot but doesn’t actually mean anything of significance. It’s eighth-grade level at best, which isn’t what will help most business users or their customers.

On the other hand, proprietary AI models that have been trained on more specific datasets that are relevant to their intended purpose. A tool that’s trained with millions of legal briefs, for example, will produce much more relevant, thoughtful and worthwhile results. These models use language that customers and other stakeholders understand. They operate within the correct context of the situation. And, they produce results while understanding sentiment and intent. When it comes to experience, relevant beats generic every day of the week.

However, businesses can’t just collect all the data across their organization and dump it into a data lake somewhere, never to be touched again. More than 100 zettabytes (yes, that’s zettabytes with a z) were created worldwide in 2022, and that number is expected to continue to explode over the next several years. You’d think that this volume of data would be more than enough to train virtually any generative AI model. However, a recent Salesforce survey revealed that 41% of business leaders cite a lack of understanding of data because it is too complex or not accessible enough. It’s clear that volume is not the issue. Putting the data into the right context, sorting and labeling the relevant information and making sure developers and other priority users have the right access is paramount.

In the past, data storage policies were written by lawyers seeking to limit regulatory and audit risk. Rules governed where and how long data had to be stored. Instead, organizations need to amend their data storage policies to make the right data more accessible and consumable. Data policies need to be modernized – dictating how the data should be used and reused, how long it needs to be kept and how to manage redundant data (copies, for example) that could skew results. 

Harnessing Highly Relevant Data that You Already Own

Recent data scraping restrictions don’t have to derail big data and AI initiatives. Instead, organizations should look internally at their own data to train generative AI models that produce more relevant, thoughtful and worthwhile results. This will require getting a better handle on the data they already collect by modernizing existing data storage policies to put information in the right context and make it more consumable for developers and AI models. Data may be the new oil, but businesses don’t have to go beyond their own borders to cash in. The answer is right there in the organization already – that data is just waiting to be thoughtfully managed and fed into new generative AI models to create powerful experiences that inform and delight.

The post Recent restrictions on data scraping don’t have to derail your generative AI initiatives appeared first on SD Times.



from SD Times https://ift.tt/BX5tc2l

Comments

Popular posts from this blog

Difference between Web Designer and Web Developer Neeraj Mishra The Crazy Programmer

Have you ever wondered about the distinctions between web developers’ and web designers’ duties and obligations? You’re not alone! Many people have trouble distinguishing between these two. Although they collaborate to publish new websites on the internet, web developers and web designers play very different roles. To put these job possibilities into perspective, consider the construction of a house. To create a vision for the house, including the visual components, the space planning and layout, the materials, and the overall appearance and sense of the space, you need an architect. That said, to translate an idea into a building, you need construction professionals to take those architectural drawings and put them into practice. Image Source In a similar vein, web development and design work together to create websites. Let’s examine the major responsibilities and distinctions between web developers and web designers. Let’s get going, shall we? What Does a Web Designer Do?

A guide to data integration tools

CData Software is a leader in data access and connectivity solutions. It specializes in the development of data drivers and data access technologies for real-time access to online or on-premise applications, databases and web APIs. The company is focused on bringing data connectivity capabilities natively into tools organizations already use. It also features ETL/ELT solutions, enterprise connectors, and data visualization. Matillion ’s data transformation software empowers customers to extract data from a wide number of sources, load it into their chosen cloud data warehouse (CDW) and transform that data from its siloed source state, into analytics-ready insights – prepared for advanced analytics, machine learning, and artificial intelligence use cases. Only Matillion is purpose-built for Snowflake, Amazon Redshift, Google BigQuery, and Microsoft Azure, enabling businesses to achieve new levels of simplicity, speed, scale, and savings. Trusted by companies of all sizes to meet

2022: The year of hybrid work

Remote work was once considered a luxury to many, but in 2020, it became a necessity for a large portion of the workforce, as the scary and unknown COVID-19 virus sickened and even took the lives of so many people around the world.  Some workers were able to thrive in a remote setting, while others felt isolated and struggled to keep up a balance between their work and home lives. Last year saw the availability of life-saving vaccines, so companies were able to start having the conversation about what to do next. Should they keep everyone remote? Should they go back to working in the office full time? Or should they do something in between? Enter hybrid work, which offers a mix of the two. A Fall 2021 study conducted by Google revealed that over 75% of survey respondents expect hybrid work to become a standard practice within their organization within the next three years.  Thus, two years after the world abruptly shifted to widespread adoption of remote work, we are declaring 20