Top 12 Text Data Collection Services in 2023

Solutions that utilize Natural Language Processing (NLP), such as generative AI tools and speech recognition (SR) systems, need human-generated text or language data for accurate operation. Businesses and developers depend on data collection services to obtain this data.

If you are considering working with language or text data collection services, this article provides a comparison of the top data collection and generation services available in the market. It also includes criteria to assist companies in narrowing down their options and a detailed evaluation section for all the companies compared in this article.

Text data collection services comparison

Selecting the right partner for collecting text data is a significant decision for any NLP project. The tables below offer the top companies in the market offering text data collection and generation services:

Table 1. Comparison based on the market presence & experience criteria

Platforms User Ratings
Out of 5 (Avg)*
Number of
Founded Data Collection
Clickworker 4.1 68 2005
Appen 4.2 54 1996
Prolific 4.7 48 2014
Amazon Mechanical Turk 4 28 2005
Telus International 4.3 10 2005
TaskUs 4.3 6 2008
Summa Linguae Technologies N/A N/A 2011
LXT N/A N/A 2010
Surge AI N/A N/A 2020
Toloka AI N/A N/A 2014
Innodata Inc N/A N/A 1988
DataForce by Transperfect N/A N/A 1992

* The data was gathered from B2B review platforms such as G2, Trustradius, and Capterra.

** If the company mentions data collection as the first offering on its website, we consider it to be data collection-focused.

Table 2. Comparison based on platform capabilities

Platforms Text
Text Data
Languages*** Mobile application API Integration ISO 27001 Certification Code of Conduct
Clickworker – Handwritten
– Typed
– Sentiment analysis
Appen – Typed
– Sentiment analysis
Prolific N/A N/A
Amazon Mechanical Turk N/A N/A N/A N/A
Telus International – Handwritten
– Typed
TaskUs – Typed
– Sentiment analysis
Summa Linguae Technologies – Typed 35+
LXT – Typed 1000+
Surge AI – Typed
Toloka AI -Typed
– Sentiment analysis
Innodata Inc -Typed
– Sentiment analysis
DataForce by Transperfect N/A 250+

*** Based on vendor claims from websites.

Notes for the tables:
  • The comparison table is created from publicly available and verifiable data.
  • Both the tables are ranked based on the number of reviews.
  • The vendors were selected based on the relevance of their services. This means that all vendors that offered text or language data collection or generation were included.
  • Apart from text data, all companies cover a wide array of data types for their data collection & annotation services (image, video, audio/speech, etc.).
  • Another filter used to narrow down the vendors was 50+ employees.
  • In Table 2, a company is assumed to follow a code of conduct if it has a code of conduct page on its website.
  • This table will not be updated regularly therefore, you can check out our data-driven list of data collection services to find the right option for your text data needs.

Criteria for selecting a text data collection service

This section covers the criteria you can use to narrow down your options of text data providers.

Market presence and experience

  • User ratings*: High average ratings on B2B platforms often indicate robust customer satisfaction.
  • Number of reviews*: A greater number of reviews typically reflects a wider user base and provides detailed insights into customer experiences.
  • Founded: The year a company was founded can be significant, as older firms often have more polished services from their experience. However, this is not a universal rule, as some companies may specialize in a particular service and acquire greater expertise in a shorter time frame. So use this criterion while analyzing customer reviews as well.
  • Data collection focus: Companies specializing primarily in data collection and generation are likely more skilled in these areas.

Platform capabilities

  • Text annotation: It can be efficient if the data provider also offers text annotation as a service since data collection and annotation are complementary to each other. 
  • Text data types/formats: Consider the text data formats the company offers.
  • Languages***: Verify which languages the service supports and whether it includes the specific language(s) you need.
  • Mobile application: Enables efficient management of projects on-the-go and unique scenarios for voice data collection.
  • API integration: Facilitates seamless data transfer and processing.
  • ISO certification: Demonstrates compliance with international standards for data security and quality.
  • Code of Conduct: Showcases a commitment to ethical treatment of the workforce.
  • Crowd size: A vast and diverse global workforce offers scalability and variety in solutions. A larger pool of workers can provide text datasets in a broader range of languages and dialects.

Figure 1. Crowd comparison of the text data collection services

A bar graph showing the crowd sizes of all the text data collection services mentioned in this article. Clickworker has the largest with 4.5 million, followed by Appen and Telus International with 1 million, and then Prolific at the last with less than 300,000.

Notes for Figure 1:

  • Companies with a crowd size of less than 100K were not included.
  • Some vendors were also excluded since their crowd size data was not found on their websites.

Company evaluation

Here is a brief summary of each company’s offerings and its performance evaluation based on customer reviews and recent news.

1. Clickworker

Clickworker offers AI data collection and generation services through its crowdsourcing platform, covering multiple data types, including text, audio, image, and video. Its offerings include:

  • Human-generated text datasets in multiple languages
  • Handwritten datasets
  • Sentiment analysis data and service
  • Text annotation services
  • Image, video, audio, and speech data collection, generation, and annotation.

Clickworker’s pros and cons

  • Customers state that Clickworker’s crowd is reliable and the platform is easy to use.1
One of the text data collection services Clickworker's positive review on reliability and ease-of-use from G2.
  • A customer review regarding Clickworker’s data annotation service and its prices.2
One of the text data collection services, Clickworker's positive review on image data annotation from G2 for the image data collection article.

2. Appen

Appen works with a crowdsourcing platform focusing on deep learning, data collection, and machine-learning models. It offers:

  • Text data collection and generation services
  • Text annotation services
  • Sentiment analysis services

Appen’s pros and cons:

  • Recent news has identified that Appen’s performance is declining as it loses customers and goes through financial losses.3
  • While some customers stated that Appen’s platform is easy to use, they also identified server crashes.4
One of the text data collection services, Appen's negative review from G2.

3. Prolific

Prolific also offers AI data collection services through a crowdsourcing platform. Here is a list of its offerings:

  • Text data collection
  • Research data
  • Does not offer data annotation as a service
  • Data labeling tools can be paired with Prolific’s tool

Prolific’s pros and cons:

  • One of the drawbacks identified by analyzing the review is that most of the reviews are regarding its research-related services. This indicates that Prolific’s AI services may not be that popular.5
  • Even though some research customers found Prolific’s customer support to be good, they had issues with the platform’s inability to set customized quotas based on geographic and demographic parameters.6
  • Prolific also offers a relatively smaller crowd than other data services.
Prolific's positive and negative reviews for its text data collection services from G2.

4. Amazon Mechanical Turk

Amazon Mechanical Turk, or MTurk, offers crowd-sourced data collection and diverse data solutions ranging from text to video. Its AI data offerings include:

  • Text data collection
  • Other data collection services (image, video, audio)

MTurk’s pros and cons:

  • While customers found MTurk’s service quick, they also found the data quality to be low.7.
Negative review of Amazon mechanical turk regarding the low quality of its text data collection services from G2.

5. Telus International

Telus International offers AI data solutions that span across machine learning, computer vision, and natural language processing. Its offerings are:

  • Custom text data collection
  • Text annotation
  • Data collection for other data types (Image, video, audio, etc)
  • Other data services for AI development.

Telus International’s pros and cons:

  • The customers have a data annotation service and offer a relatively larger network of data collectors/annotators.
  • There were no reviews found regarding the company’s data collection services, which can make it difficult for potential buyers to evaluate its performance.

6. TaskUS

TaskUS also operates with a crowdsourcing model to offer text data solutions. However, its key offering is in the customer experience domain. Its offerings include:

  • Text data collection/generation
  • Sentiment analysis is offered
  • Sentiment data is not offered.

7. Summa Linguae Technologies

With a focus on custom solutions, Summa Linguae offers tools and services catering to different AI project requirements. Here are Summa Linguae’s offerings:

  • Custom data collection, including all data types (Text, image, video, etc)
  • Text annotation
  • Machine learning model training data
  • Data security and quality assurance

8. LXT

LXT is also an emerging player in the data collection space, offering various services for AI development. Its offerings include:

  • Text data collection for NLP
  • Text data annotation
  • Data collection for other data types (Image, video, audio)

9. Surge AI

Based in California, Surge AI provides training data for machine learning models through a crowdsourcing platform. Surge AI focuses on collecting and labeling data for Large language models (LLMS). Here are some of their data services:

  • Text data collection
  • Text data labeling and annotation
  • Reinforcement Learning from Human Feedback (RLHF)
  • And other human-generated data services

10. Toloka AI

Operating with a crowdsourcing platform, Toloka AI specializes in collecting data for AI models, especially natural language processing (NLP). Its offerings include:

  • Text data solutions
  • Text annotation
  • Data collection of other data types

Toloka AI’s pros and Cons

  • The company claims to offer text data collection and annotation in multiple languages.
  • Toloka AI operated with a significantly smaller crowd size as compared to companies like Clickworker and Appen.
  • B2B customer reviews were not found, which can make it difficult for potential customers to evaluate its services from the customer’s perspective.

11. Innodata Inc

Specializing in creating AI training data, Innodata Inc. offers custom data solutions to train machine learning models. Its AI data services include:

  • Text data collection service
  • Machine learning project consultancy
  • Data security solutions

12. DataForce by Transperfect

DataForce caters to specific AI development needs, offering a blend of text, image, video, and audio/speech data.


  • Audio and voice datasets
  • Image and video data collection services
  • Experienced project managers for AI needs

Final recommendations

As solutions powered by AI, machine learning, and NLP become increasingly important in business processes, the need to work with text data services is anticipated to rise.

These services are crucial for gathering the data required for AI to effectively understand and process various languages. By selecting a data partner that follows the above-mentioned standards, organizations can secure high-quality, ethically sourced, and accurately annotated data, establishing a robust groundwork for their AI projects.

You can also consider the following key points while selecting a vendor:

  • Level of diversity: It is important to work with a partner that offers a large and diverse workforce. This will ensure it can provide a scalable service in a timely manner.
  • Customer satisfaction: You can analyze reviews and assess whether the company can meet deadlines. 
  • Clear description and understanding: Clarify edge cases and potential issues in advance, so the workforce can work efficiently without needing to pause and ask for clarification.

Transparency statement

AIMultiple serves numerous emerging tech companies and vendors, including the ones linked in this article.

Further reading

If you need help finding a vendor or have any questions, feel free to contact us:

Find the Right Vendors

External resources

  1. Clickworker customer review on reliability and easy-to-use platform. G2. Accessed: 05/December/2023.
  2. Clickworker’s review regarding data annotation services. G2. Accessed: 05/December/2023.
  3. Hayden Field, (2023). Inside the turmoil at Appen, the former AI darling that’s reeling from executive exits, big losses. CNBC. Accessed: 05/December/2023.
  4. Appen’s negative review regarding server crashes. G2. Accessed: 04/December/2023.
  5. Most Prolific reviews are for its research services. G2. Accessed: 05/December/2023.
  6. Prolific’s review on customer support and customized parameters. G2. Accessed: 05/December/2023
  7. Negative review regarding MTurk’s data collection service. G2. Accessed: 05/December/2023.

Source link

This post originally appeared on TechToday.

Leave a Reply

Your email address will not be published. Required fields are marked *