Solutions that utilize Natural Language Processing (NLP), such as generative AI tools and speech recognition (SR) systems, need human-generated text or language data for accurate operation. Businesses and developers depend on data collection services to obtain this data.
If you are considering working with language or text data collection services, this article provides a comparison of the top data collection and generation services available in the market. It also includes criteria to assist companies in narrowing down their options and a detailed evaluation section for all the companies compared in this article.
Text data collection services comparison
Selecting the right partner for collecting text data is a significant decision for any NLP project. The tables below offer the top companies in the market offering text data collection and generation services:
Table 1. Comparison based on the market presence & experience criteria
Platforms | User Ratings Out of 5 (Avg)* |
Number of Reviews* |
Founded | Data Collection Focus** |
---|---|---|---|---|
Clickworker | 4.1 | 68 | 2005 | ✅ |
Appen | 4.2 | 54 | 1996 | ✅ |
Prolific | 4.7 | 48 | 2014 | ✅ |
Amazon Mechanical Turk | 4 | 28 | 2005 | ✅ |
Telus International | 4.3 | 10 | 2005 | ✖ |
TaskUs | 4.3 | 6 | 2008 | ✖ |
Summa Linguae Technologies | N/A | N/A | 2011 | ✅ |
LXT | N/A | N/A | 2010 | ✅ |
Surge AI | N/A | N/A | 2020 | ✖ |
Toloka AI | N/A | N/A | 2014 | ✅ |
Innodata Inc | N/A | N/A | 1988 | ✅ |
DataForce by Transperfect | N/A | N/A | 1992 | ✅ |
* The data was gathered from B2B review platforms such as G2, Trustradius, and Capterra.
** If the company mentions data collection as the first offering on its website, we consider it to be data collection-focused.
Table 2. Comparison based on platform capabilities
Platforms | Text Annotation |
Text Data Types/Formats |
Languages*** | Mobile application | API Integration | ISO 27001 Certification | Code of Conduct |
---|---|---|---|---|---|---|---|
Clickworker | ✅ | – Handwritten – Typed – Sentiment analysis |
30+ | ✅ | ✅ | ✅ | ✅ |
Appen | ✅ | – Typed – Sentiment analysis |
235+ | ✅ | ✅ | ✅ | ✅ |
Prolific | ✖ | N/A | N/A | ✖ | ✅ | ✖ | ✅ |
Amazon Mechanical Turk | N/A | N/A | N/A | ✖ | ✅ | N/A | ✖ |
Telus International | ✅ | – Handwritten – Typed |
500+ | ✖ | ✅ | ✖ | ✖ |
TaskUs | ✅ | – Typed – Sentiment analysis |
65+ | ✖ | ✅ | ✅ | ✅ |
Summa Linguae Technologies | ✅ | – Typed | 35+ | ✅ | ✅ | ✅ | ✖ |
LXT | ✅ | – Typed | 1000+ | ✖ | ✖ | ✅ | ✖ |
Surge AI | ✅ | – Typed | ✖ | ✅ | ✅ | ✖ | |
Toloka AI | ✅ | -Typed – Sentiment analysis |
40+ | ✅ | ✅ | ✅ | ✅ |
Innodata Inc | ✅ | -Typed – Sentiment analysis |
40+ | ✖ | ✅ | ✅ | ✖ |
DataForce by Transperfect | ✅ | N/A | 250+ | ✅ | ✖ | ✅ | ✖ |
*** Based on vendor claims from websites.
Notes for the tables:
- The comparison table is created from publicly available and verifiable data.
- Both the tables are ranked based on the number of reviews.
- The vendors were selected based on the relevance of their services. This means that all vendors that offered text or language data collection or generation were included.
- Apart from text data, all companies cover a wide array of data types for their data collection & annotation services (image, video, audio/speech, etc.).
- Another filter used to narrow down the vendors was 50+ employees.
- In Table 2, a company is assumed to follow a code of conduct if it has a code of conduct page on its website.
- This table will not be updated regularly therefore, you can check out our data-driven list of data collection services to find the right option for your text data needs.
Criteria for selecting a text data collection service
This section covers the criteria you can use to narrow down your options of text data providers.
Market presence and experience
- User ratings*: High average ratings on B2B platforms often indicate robust customer satisfaction.
- Number of reviews*: A greater number of reviews typically reflects a wider user base and provides detailed insights into customer experiences.
- Founded: The year a company was founded can be significant, as older firms often have more polished services from their experience. However, this is not a universal rule, as some companies may specialize in a particular service and acquire greater expertise in a shorter time frame. So use this criterion while analyzing customer reviews as well.
- Data collection focus: Companies specializing primarily in data collection and generation are likely more skilled in these areas.
Platform capabilities
- Text annotation: It can be efficient if the data provider also offers text annotation as a service since data collection and annotation are complementary to each other.
- Text data types/formats: Consider the text data formats the company offers.
- Languages***: Verify which languages the service supports and whether it includes the specific language(s) you need.
- Mobile application: Enables efficient management of projects on-the-go and unique scenarios for voice data collection.
- API integration: Facilitates seamless data transfer and processing.
- ISO certification: Demonstrates compliance with international standards for data security and quality.
- Code of Conduct: Showcases a commitment to ethical treatment of the workforce.
- Crowd size: A vast and diverse global workforce offers scalability and variety in solutions. A larger pool of workers can provide text datasets in a broader range of languages and dialects.
Figure 1. Crowd comparison of the text data collection services

Notes for Figure 1:
- Companies with a crowd size of less than 100K were not included.
- Some vendors were also excluded since their crowd size data was not found on their websites.
Company evaluation
Here is a brief summary of each company’s offerings and its performance evaluation based on customer reviews and recent news.
1. Clickworker
Clickworker offers AI data collection and generation services through its crowdsourcing platform, covering multiple data types, including text, audio, image, and video. Its offerings include:
- Human-generated text datasets in multiple languages
- Handwritten datasets
- Sentiment analysis data and service
- Text annotation services
- Image, video, audio, and speech data collection, generation, and annotation.
Clickworker’s pros and cons
- Customers state that Clickworker’s crowd is reliable and the platform is easy to use.1

- A customer review regarding Clickworker’s data annotation service and its prices.2

2. Appen
Appen works with a crowdsourcing platform focusing on deep learning, data collection, and machine-learning models. It offers:
- Text data collection and generation services
- Text annotation services
- Sentiment analysis services
Appen’s pros and cons:
- Recent news has identified that Appen’s performance is declining as it loses customers and goes through financial losses.3
- While some customers stated that Appen’s platform is easy to use, they also identified server crashes.4

3. Prolific
Prolific also offers AI data collection services through a crowdsourcing platform. Here is a list of its offerings:
- Text data collection
- Research data
- Does not offer data annotation as a service
- Data labeling tools can be paired with Prolific’s tool
Prolific’s pros and cons:
- One of the drawbacks identified by analyzing the review is that most of the reviews are regarding its research-related services. This indicates that Prolific’s AI services may not be that popular.5
- Even though some research customers found Prolific’s customer support to be good, they had issues with the platform’s inability to set customized quotas based on geographic and demographic parameters.6
- Prolific also offers a relatively smaller crowd than other data services.

4. Amazon Mechanical Turk
Amazon Mechanical Turk, or MTurk, offers crowd-sourced data collection and diverse data solutions ranging from text to video. Its AI data offerings include:
- Text data collection
- Other data collection services (image, video, audio)
MTurk’s pros and cons:
- While customers found MTurk’s service quick, they also found the data quality to be low.7.

5. Telus International
Telus International offers AI data solutions that span across machine learning, computer vision, and natural language processing. Its offerings are:
- Custom text data collection
- Text annotation
- Data collection for other data types (Image, video, audio, etc)
- Other data services for AI development.
Telus International’s pros and cons:
- The customers have a data annotation service and offer a relatively larger network of data collectors/annotators.
- There were no reviews found regarding the company’s data collection services, which can make it difficult for potential buyers to evaluate its performance.
6. TaskUS
TaskUS also operates with a crowdsourcing model to offer text data solutions. However, its key offering is in the customer experience domain. Its offerings include:
- Text data collection/generation
- Sentiment analysis is offered
- Sentiment data is not offered.
7. Summa Linguae Technologies
With a focus on custom solutions, Summa Linguae offers tools and services catering to different AI project requirements. Here are Summa Linguae’s offerings:
- Custom data collection, including all data types (Text, image, video, etc)
- Text annotation
- Machine learning model training data
- Data security and quality assurance
8. LXT
LXT is also an emerging player in the data collection space, offering various services for AI development. Its offerings include:
- Text data collection for NLP
- Text data annotation
- Data collection for other data types (Image, video, audio)
9. Surge AI
Based in California, Surge AI provides training data for machine learning models through a crowdsourcing platform. Surge AI focuses on collecting and labeling data for Large language models (LLMS). Here are some of their data services:
- Text data collection
- Text data labeling and annotation
- Reinforcement Learning from Human Feedback (RLHF)
- And other human-generated data services
10. Toloka AI
Operating with a crowdsourcing platform, Toloka AI specializes in collecting data for AI models, especially natural language processing (NLP). Its offerings include:
- Text data solutions
- Text annotation
- Data collection of other data types
Toloka AI’s pros and Cons
- The company claims to offer text data collection and annotation in multiple languages.
- Toloka AI operated with a significantly smaller crowd size as compared to companies like Clickworker and Appen.
- B2B customer reviews were not found, which can make it difficult for potential customers to evaluate its services from the customer’s perspective.
11. Innodata Inc
Specializing in creating AI training data, Innodata Inc. offers custom data solutions to train machine learning models. Its AI data services include:
- Text data collection service
- Machine learning project consultancy
- Data security solutions
12. DataForce by Transperfect
DataForce caters to specific AI development needs, offering a blend of text, image, video, and audio/speech data.
Offerings:
- Audio and voice datasets
- Image and video data collection services
- Experienced project managers for AI needs
Final recommendations
As solutions powered by AI, machine learning, and NLP become increasingly important in business processes, the need to work with text data services is anticipated to rise.
These services are crucial for gathering the data required for AI to effectively understand and process various languages. By selecting a data partner that follows the above-mentioned standards, organizations can secure high-quality, ethically sourced, and accurately annotated data, establishing a robust groundwork for their AI projects.
You can also consider the following key points while selecting a vendor:
- Level of diversity: It is important to work with a partner that offers a large and diverse workforce. This will ensure it can provide a scalable service in a timely manner.
- Customer satisfaction: You can analyze reviews and assess whether the company can meet deadlines.
- Clear description and understanding: Clarify edge cases and potential issues in advance, so the workforce can work efficiently without needing to pause and ask for clarification.
Transparency statement
AIMultiple serves numerous emerging tech companies and vendors, including the ones linked in this article.
Further reading
If you need help finding a vendor or have any questions, feel free to contact us:
Find the Right Vendors
External resources
- Clickworker customer review on reliability and easy-to-use platform. G2. Accessed: 05/December/2023.
- Clickworker’s review regarding data annotation services. G2. Accessed: 05/December/2023.
- Hayden Field, (2023). Inside the turmoil at Appen, the former AI darling that’s reeling from executive exits, big losses. CNBC. Accessed: 05/December/2023.
- Appen’s negative review regarding server crashes. G2. Accessed: 04/December/2023.
- Most Prolific reviews are for its research services. G2. Accessed: 05/December/2023.
- Prolific’s review on customer support and customized parameters. G2. Accessed: 05/December/2023
- Negative review regarding MTurk’s data collection service. G2. Accessed: 05/December/2023.
This post originally appeared on TechToday.