Crossover: 2014

Chapter 259 Crazy Data

Chapter 259 Crazy Data

Take such an example:
In image recognition, often we may need millions of manually labeled data,

In speech recognition, we may need tens of thousands of hours of human-labeled data.

When it comes to machine translation, tens of millions of sentence annotation data are required.

To be honest as a technician from the past few years.

Lin Hui really didn't take it seriously when it came to the value of manually labeled data.

But now it seems that the value of this thing was obviously ignored by Lin Hui.

Lin Hui remembered that a set of data that he saw in his previous life in 2017 was related to human translation.

The cost of a word is about 5-10 cents, and the average length of a sentence is about 30 words.

If we need to label 1000 million bilingual sentence pairs, that is, we need to find experts to translate 1000 million sentences, and the cost of this labeling is almost 2200 million US dollars.

It can be seen that the cost of data labeling is very, very high.

And this is just the cost of data labeling in 2017.

Doesn't the labeling cost mean higher data labeling costs now?
You must know that there is little attention to unsupervised learning now.

In terms of unsupervised learning, there are almost no models that can be used.

In mainstream machine learning, it still relies on supervised learning and semi-supervised learning.

However, supervised learning and semi-supervised learning are basically inseparable from manually labeled data.

Measured from this perspective, isn't the large amount of ready-made manually labeled data owned by Lin Hui a huge invisible wealth?

If we say that in the previous life in 2017, labeling 1000 million pieces of bilingual data would cost more than 2000 million US dollars.

Then in 2014, a time and space when machine learning as a whole is lagging behind.

How much does it cost to label the same 1000 million pieces of bilingual data?
Lin Hui felt that 1000 million pieces of bilingual labeled data would cost two to three hundred million U.S. dollars.

The data of "two to three billion U.S. dollars" seems a bit scary.

But it's not an exaggeration.

There are two reasons why this is not an exaggeration:

[-]. Even in the previous life, the cost of data labeling dropped significantly after the advent of special learning techniques such as dual learning.

Before that, the word "cheap" had never been involved in data labeling.

Also take the example listed by Lin Hui before as a reference:
In the previous life in 2017, the cost of 1000 million bilingual translation annotations was about 2200 million US dollars;
Note that this is only a label for bilingual translation.

"Bilingual translation" is just a label for mutual translation between two languages.

More than 2000 million US dollars is needed just for the mutual translation and annotation between the two languages?
How much does it cost to translate between hundreds of languages?
This problem is not complicated, a simple permutation and combination problem:

C(100,2)== 4950; 4950*0.22亿美元==1089亿美元;
It is not difficult to see that if it is necessary to support mutual translation of hundreds of languages, the cost of manually annotating the training set will reach hundreds of billions of dollars.

And this is only an estimate under ideal conditions. If you really want to carry out such labeling step by step, the actual cost is far more than that.

After all, the cost of mutual translation between many minor languages ​​is obviously higher than the cost of mutual translation between mainstream languages.

Although there will be no real complaints in actual operation, the data labeling of hundreds of language translations will be carried out step by step.

But this estimate also speaks volumes about how data labeling will be expensive for quite some time.

For the same reason, the cost of spatio-temporal data labeling is still expensive.

And because of the lag in research progress in space-time machine learning, the cost involved in data labeling is even higher than that in the same period of the previous life.

60. The times are developing rapidly. You must know that the scientific calculators that can be easily bought in any sports store now have actual efficiency, reliability, and ease of use. They can even cost tens of millions in the [-]s and [-]s of the last century. Computers made with US dollars occupy hundreds or even thousands of square meters.

In this case, there is still a market for cheap calculators in later generations even if they cost millions of dollars a few decades ago, and they may still be quite competitive.

Taking this example does not mean that Lin Hui will sell calculators in the past few decades.

Lin Hui just wanted to use this to show that the wheel of the times is moving forward, and technology is also developing rapidly.

Especially in the post-internet era, it is no exaggeration to say that the development of science and technology is changing with each passing day.

Under such circumstances, it is normal that some technologies that were not overly valued in the next few years could be exchanged for large amounts of wealth a few years ago.

What's more, use data to label this thing that can only be played by local tyrants for a long period of time in exchange for wealth?

In short, Lin Hui didn't think there was any problem with the estimate that "the current 1000 million pieces of bilingual annotation data will cost two to three hundred million U.S. dollars no matter what."

Even, even if it is "a price of two to three billion US dollars", the estimate of this price may be a bit conservative.

In the industrial structure of artificial intelligence, the main body includes application layer, technical layer and basic layer.

The application layer contains solutions and product services.

The technical layer includes application technology, algorithm theory and platform framework.

The base layer contains infrastructure and data.

Measured from this perspective, data can even be regarded as the cornerstone of artificial intelligence to some extent.

That's exactly what happened.

The troika algorithm, computing power, and computing data (data) related to artificial intelligence.

Algorithms seem to be very important, but we must know that in many cases, without high-quality data, it is difficult to train high-quality algorithms.

Although data is usually invisible and intangible, no one can ignore the importance of data.

Especially labeled data is very important.

Supervised machine learning is still the main way of neural network learning and training nowadays.

Supervised machine learning is inseparable from labeled data.

Supervised machine learning requires labeled data as prior experience.

In supervised machine learning, unlabeled data and labeled data are proportionally divided into training and testing sets.

The machine obtains a model by learning the training set, and then recognizes the test set to obtain the accuracy of the model.

Algorithm personnel find the shortcomings of the model based on the test results, and feed back the data problems to the data labeling personnel, and then repeat the process until the obtained model indicators meet the online requirements...

In the absence of unsupervised learning applications nowadays, large-scale, high-quality manually labeled data sets can even be said to be just needed for the development of the machine learning industry.

In this case, the importance of data and labeled data cannot be overstated.

That's why Lin Hui said that the valuation is underestimated.

However, the so-called valuation is not important anymore. If it really involves the sale of labeled data, the specific price can be discussed slowly.

Lin Hui needs a lot of money, but if it is to negotiate with some super giants in the future, Lin Hui may not necessarily want money.

It is not impossible to exchange resources that Lin Hui is interested in.

To be honest, some of the resources of these top giants are quite attractive to Lin Hui.

It is specific to the labeled data that Lin Hui currently has.

When it came to web translation, Lin Hui almost immediately thought of the software SimpleT in the mobile phone in his previous life.

SimpleT is a software developed and tested by Lin Hui's company in his previous life.

This software is not very well known because the software is still in the alpha stage.

The purpose of alpha testing is to evaluate the functionality, localization, usability, reliability, performance, and support of a software product.

Pay special attention to the interface and features of the product.

The time for alpha testing can begin when the coding of the software product is completed.

It can also be started after the module (subsystem) tests are completed.

It is also possible to start after confirming that the product has reached a certain level of stability and reliability during testing.

The α closed beta of SimpleT software starts after confirming that SimpleT has reached a certain level of stability and reliability.

So although SimpleT is still in internal testing.

However, the technical level of this software is quite mature, and it is almost only one round of public beta away from the official launch.

Lin Hui originally thought that when the time was right, he would reproduce such a software to enter the software translation market.

While being mindful of the special value held by labeled data.

Lin Hui also thought of SimpleT almost immediately.

After all, as a software that focuses on AI translation, it naturally uses a large amount of bilingual translation annotation data in the training process.

And SimpleT, an unofficially listed software, was one of the products that the previous company placed high hopes on.

For the marked data used in the actual development of this software, Lin Hui believes that he can definitely find it in the corporate data of his previous life.

In this case, it seems that directly exchanging the labeled data used by the company to tune the SimpleT software for money, wouldn't it be more effective?
Although the software SimpleT did not carry out all-language inter-translation data labeling when it was built.

But at the very least, there must be data annotations for mutual translation between common languages ​​such as Chinese, English, Russian, French, Spanish, and Japanese.

Even if the inter-translation data between these languages ​​does not reach the scale of tens of millions of inter-translation annotations between all languages.

But at least the Chinese-English and English-Chinese translation annotation data must still be quite large.

Under such circumstances, Lin Hui estimated that the annotation data used by the SimpleT software in his previous life would have a value of at least seven or eight billion dollars today.

This is undoubtedly a considerable fortune.

The most important thing is that even if Lin Hui took the inter-translation annotation data between these languages ​​to exchange for money.

It does not prevent Lin Hui from introducing SimpleT software to the translation market.

Uh, although it's a bit of a profiteer's style.

But how to put it, it is normal to eat more than one chicken.

It can even be said that eating more than one chicken is a typical commercial feature in the Internet age.

Although Lin Hui is unlikely to suddenly involve in the mutual translation annotation in the field of translation in a short period of time.

But the labeled data in Lin Hui's hands is not just in the field of translation.

Let's deal with this aspect with the natural language that Lin Hui is cultivating at this time.

Although Lin Hui mainly used unsupervised training to obtain a large amount of data and corresponding model training in the construction of the previous generative text summarization model.

But Lin Hui does have labeled data in the direction of natural language processing.

And it is very large-scale text annotation data.

This is a considerable fortune.

Although the value of this kind of text data annotation is definitely discounted compared to bilingual inter-translation text annotation (which has a higher threshold for annotation).

But in the case of forming a scale, even general labeled data is a wealth that cannot be underestimated.

Lin Hui estimated that it would be no problem to exchange tens of millions of dollars for some ordinary labeled data related to text summarization.

If you pack these labeled data and are lucky enough to meet some knowledge (yuan) goods (da) people (tou).

During business negotiations, if the negotiators are very good at negotiating, it is also possible to talk about nearly [-] million US dollars.

If the marked data is packaged to a certain extent, Lin Hui estimates that it will be no problem to fool hundreds of millions of dollars.

What does it mean to package these labeled data to a certain extent?

It is to beautify the quality of the labeled data.

Strictly speaking, the same labeled data can be divided into expert labeling and crowdsourcing.

The so-called "expert labeling" is not really an expert to label.

"Data labeling" sounds like a big deal, but in reality?
The process of data labeling is often very complicated, and when a large amount of data is involved, there will be high requirements for manual labor.

Although it can't be said to be low-end, at least this kind of mechanical and complicated work has nothing to do with high-end work. Professor Zhuojia would definitely not do this work.

The so-called expert annotation is generally done part-time by hard-working algorithm engineers.

Or be marked by a specialized algorithm data labeler.

So-called data annotators are a new profession.

In the previous life, with the advent of the era of big data and artificial intelligence, a new type of profession appeared on the Internet in order to cope with the work of data labeling-data labelers.

The job of a data labeler is to use corresponding tools to crawl and collect data from the Internet, including text, pictures, voice, etc.

The captured data is then organized and labeled.

These data mark employees' specific workflows are generally clear:

First, the labelers are trained to determine the sample data that needs to be labeled and the labeling rules;

Then, mark the sample data according to the pre-arranged rules;

Second, merge the results after labeling.

Algorithmic data labelers are slightly different from general data labelers.

Compared with general data labeling apes, algorithmic data labeling apes are often done after completing the above steps.

It is also necessary to feed the model with labeled data to debug the model.

Although this is just an extra step in the workflow, professional algorithmic data labelers are still rare.

From the previously listed tasks, it can also be seen that the task of the algorithm data labeler is not just data labeling.

It is often necessary to further evaluate the algorithm model based on the labeled data.

As a result, many times the requirement for secretary labelers is that these people not only need to perform data labeling.

You have to understand the corresponding algorithm.

(End of this chapter)

Tap the screen to use advanced tools Tip: You can use left and right keyboard keys to browse between chapters.

You'll Also Like