Revolutionary innovation: Dolly 2.0 from Databricks brings cost-effective language models to enterprises and startups
Databricks presents Dolly 2.0 - An open-source language model for enterprises and startups
Data privacy and compliance are critical issues for businesses today, especially for startups and small to mid-sized companies. As a Data-Privacy-as-a-Service start-up, heyData offers an all-in-one platform solution that helps companies efficiently manage their data privacy and compliance requirements. In this context, Dolly 2.0, Databricks' latest language model, is significant.
As a Data-Privacy-as-a-Service startup, we are constantly looking for innovative solutions to help startups, enterprises and founders meet their data privacy and compliance requirements. Today, we are excited to introduce a breakthrough innovation: Dolly 2.0, the world's first open and instruction-led Language Model (LLM), developed by Databricks.
Background
In the world of AI models, Databricks has made a remarkable advance with Dolly 2.0. Dolly 2.0 is a ChatGPT-like language model trained for less than $30. It is based on the EleutherAI Pythia model family and has been fine-tuned by Databricks crowdsourced contributors with a human-generated instruction dataset licensed for research and commercial use. This unique model is now open source, providing enterprises and startups with a cost-effective way to build and customize powerful language models for their conversational interactions.
Open source: Free use for companies and startups
Dolly 2.0 is open-sourced, which means organizations can create, customize, and own their own speech models without having to pay for paid API services or share data with third parties. This is a groundbreaking development for enterprises and startups looking for cost-effective solutions for their voice interactions. With Dolly 2.0, they have the freedom to customize and extend the model to meet their specific needs.
Unique instruction dataset for fine-tuning
A unique feature of Dolly 2.0 is the instruction dataset created by Databricks crowd-sourced contributors. The Databricks Dolly 15k dataset, with 15,000 prompt/response pairs, was developed specifically for matching large language models to instructions and is available under a Creative Commons Attribution-ShareAlike 3.0 Unported license. This means that it can be used, modified, or extended by anyone, including commercial applications. This dataset is the first open-source human-generated instruction dataset that allows large-scale language models to show the interactivity of ChatGPT. It contains natural, expressive training records representing various behaviors such as brainstorming, content creation, information retrieval, and summarization.
Motivation
The creation of the new Dolly 2.0 dataset was motivated by requests from users who wanted to know if they could use Dolly commercially to circumvent restrictions on commercial use of the original Dolly 1.0 model. Databricks responded to these requests by creating a new dataset specifically for commercial applications. This underscores Databricks' commitment to the needs of its users and gives companies access to powerful language models for their commercial applications. The motivation for creating this new dataset was due to requests from users who wanted to know if they could use Dolly commercially to get around the commercial use limitations of the original Dolly 1.0 model. Dolly 1.0 was trained with a dataset from the Stanford Alpaca team using the OpenAI API, but this imposed restrictions on commercial use due to the terms of use. Databricks then decided to create a new dataset that was not "contaminated" and could be used for commercial purposes. To do this, they took inspiration from OpenAI's InstructGPT research and engaged Databricks staff in a competition to generate an original and high-quality dataset that included various tasks such as open and closed Q&A, information extraction and summarization from Wikipedia, classification, and creative writing.
Inspiration from InstructGPT
The development of Dolly 2.0 was inspired by OpenAI's groundbreaking research paper on InstructGPT. InstructGPT is a language model specifically trained to follow instructions and perform complex tasks. Dolly 2.0 is based on EleutherAI's Pythia family of models and has been trained on the Databricks Dolly 15k dataset to develop similar capabilities in interacting with instructions. This enables Dolly 2.0 to handle a variety of tasks such as brainstorming, content creation, information retrieval, and summarization, and provide real support for users in different application domains.
Advantages for startups and enterprises
Releasing Dolly 2.0 as open source offers numerous benefits for startups and enterprises. Here are some of the most important:
- Cost efficiency: because Dolly 2.0 is open source, startups and enterprises can use the software for free without having to pay expensive licensing fees or subscriptions. This allows them to use resources for other important aspects of their business model.
- Flexibility: Being open source, Dolly 2.0 offers users the ability to customize and adapt the software according to their own needs. Startups and enterprises can customize Dolly 2.0's functions and features to meet their specific needs in order to develop tailored solutions.
- Community Engagement: The open source community is known for its collaboration and sharing of knowledge and resources. By releasing Dolly 2.0 as open source, startups and enterprises can benefit from working with the developer community to fix bugs, implement new features and further improve the software.
- Faster innovation: open source allows startups and enterprises to build on an existing code base to develop innovative solutions faster. By using Dolly 2.0 as open source, they can benefit from the work of other developers and build their own innovations on a proven platform.
- Interoperability: As open source, Dolly 2.0 can be integrated with various technologies and systems, allowing startups and enterprises to interact with other products and services and extend their functionality.
- Transparency and trust: Since Dolly 2.0's source code is open source, startups and enterprises can review the code and ensure that it is secure and trustworthy. This can help build customer and user trust in the software.
- Resource sharing: by using open source, startups and companies can share resources and exchange ideas with other developers and organizations. This can lead to more efficient use of resources and create synergies to work together on new solutions.
- Adaptability: the open source nature of Dolly 2.0 allows startups and enterprises to adapt the software to new technologies, market requirements or business models. This allows them to be agile in their response and continuously improve their solutions to remain competitive.
In summary, Databricks' Dolly 2.0 provides enterprises with a powerful language model development solution with advanced features such as transfer learning, cultural customization and monitoring capabilities. It enables organizations to create high-quality, adaptable language models and integrate them into their existing workflows and data processing pipelines. With Dolly 2.0, companies can harness language AI technology to improve their use cases, optimize communication with their target audience, and streamline their business processes.
More articles
How to Achieve NIS2 Compliance: What Businesses Need to Know
The NIS2 Directive, effective from October 17, 2024, strengthens the EU's cybersecurity framework by expanding on the 2016 NIS Directive. It applies to large and medium enterprises in critical sectors like energy, transport, banking, and healthcare, as well as some smaller firms, especially those impacting essential services. NIS2 mandates stringent security measures, emphasizing risk management, corporate accountability, incident reporting, business continuity, and inter-state cooperation. Companies must comply to avoid penalties, with significant focus on proactive cybersecurity strategies and cross-border collaboration within the EU.
Learn moreNavigating AI Compliance: A Guide for Startups
The EU AI Act requires startups to document AI systems, assess risks, and train employees. Our guide breaks down key steps—from AI inventory to risk assessment. Using CrediScore-AI as an example, we showcase how a fintech startup successfully navigated compliance by classifying systems by risk and providing targeted training.
Learn moreISO 27001: The Ultimate Guide to Compliance and Certification
ISO 27001 is an essential standard for managing information security, ensuring sensitive data is handled systematically. This blog serves as a thorough guide to ISO 27001 certification, outlining its main requirements and advantages for businesses. It emphasizes how organizations of any size can improve data protection and show their dedication to cybersecurity. The article contrasts ISO 27001 with NIS2, explores their distinctions and connections, provides real-world adoption examples, and presents a compliance framework with steps on using tools like heyData for effective implementation.
Learn more