Chapter 12 Treat yourself like a donkey
Of course, it is not enough to have an analysis module and a collection module, it also needs to have a data processing module, if the analysis module is the brain, the collection module is the hands and feet, then the processing module is the digestive system. Pen "Fun" Pavilion www.biquge.info
When a large amount of data information is extracted from the data ocean, it is necessary to process this information and process it into a data model that can be used by the data analysis module.
For example, an annual financial report of a listed company contains a lot of content, from personnel changes to corporate strategy, mergers and acquisitions, profits and revenues, etc., and these things are key information. A financial report of tens of thousands of words plus various charts, in which various key information must be able to be understood and processed by the analysis module, this is the main job of the processing module.
The main functional part of this data processing module is actually natural language processing, and the program itself cannot understand the language connotation, and it certainly cannot understand what it means to "private placement of 1 million restricted shares".
For example, first of all, the meaning is divided, the orientation is set as a meaning unit, and the additional issuance is set as another unit, so that the whole sentence is divided according to the meaning unit and assigned separately.
This set of processing methods involves the language processing problem of human-computer interaction, which requires artificial help for computers to understand and process human language, so that machines can understand grammar and semantic units, be able to connect with contexts, and be able to deal with different meanings of the same phrase in different contexts.
To put it simply, enabling machines to understand human language is the main goal of natural language and the main function of this processing module.
Analysis, collection, and processing, these three modules are the main functional structure of Gushen 1.0, but this is not enough, and Gushen also needs a lot of auxiliary modules.
For example, it needs to have a storage module, all the data and information are collected, must be sorted and processed, and then classified and stored, it is like a super library, it must have its own classification and storage rules. If you don't have that, you can simply pile them on top of each other, and you can imagine the painful and terrifying process of finding a particular page from tens of millions of books.
In addition, the god of shares also needs the corresponding display and interaction module, as a software, it needs to have its own operation interface, it needs to be able to display the processing results or process, and it needs to be able to receive instructions and carry out human-computer interaction.
These five modules are combined together, and can smoothly cooperate with each other, the stock god system can be regarded as basically formed, and there will definitely be all kinds of problems in the middle, which need to be solved one by one. In the process of use, it must also involve constant updating and improvement, all of which will be Mo Hui's work.
According to Mo Hui's estimate, it is unlikely that the volume of the entire stock god will be less than 1 million lines of code, and if you want to make the stock god as perfect and accurate as possible, then its volume will definitely turn over and up. If you want to achieve something, you have to pay a price, and if you want to make your predictions as accurate as possible, then you must keep pouring in.
This is just the god itself, if you want the god to operate, then Mo Hui will inevitably face the problem of bandwidth, once the crawler runs, massive data will be sent back, and these data are at least T-level.
In the computer world, the unit of data size is 1024 base, a byte is a byte, 1024 bytes is kb, 1024k is m, 1024m is g, and 1024g is t.
For example, the storage capacity of our mobile phone may be 4G, the storage capacity of a laptop may be 400G, and the 400G of a notebook is equivalent to a thousand movies.
And the data collected by the stock god through the crawler must be massive, at least T-level, even if it runs to the P level, it is not a big deal. For example, 1P data is roughly equivalent to 2.5 million movies. A person's life is only 30,000 days, and watching ten movies a day is enough to watch ten lifetimes.
In the face of such a large amount of data, Mo Hui will inevitably face a bandwidth problem, and it is easy to imagine that the community broadband in the rental house must be difficult to use.
Now that the computing power of the ultrabook has been verified, it should be relatively extraordinary, but its storage capacity has not been tested, if the storage capacity is not successful, Mo Hui must also find a storage space for this massive amount of data.
There are many problems like this, if Mo Hui wants to complete the stock god and put it up and running, then he must be like an old scalper, go forward diligently, and dispose of all these roadblocks one by one.
Originally, these things were handed over to a company to deal with, and a mature team to deal with them may not be able to handle them well, but now Mo Hui needs to be handled by one person, and it is likely that it must be done by a person who is not obvious, and the difficulty in this can be imagined.
Thinking about the future ahead, Mo Hui feels like climbing Mount Everest, so high~~~
Fortunately, Mo Hui is more or less an industry insider, and these things can basically be regarded as their own work, which is nothing more than the project manager, product manager, main process, and architecture. It's a little harder, and the workload is a little bigger, but there is a solution, as long as you follow the road step by step, there will always be a day when it will be completed.
It's a lot of work, but it's not without shortcuts, so Mo clicks on the webpage and starts collecting the open source software he needs. He went to the open source house to search, and there were more than 100 open source crawlers, and it was estimated that there would be suitable ones.
He simply searched for the five modules, and most of them have similar alternative software, and now all he needs to do is to find the most suitable one in it, and then modify it and assemble it.
First of all, you need to choose the development language, each language has its own scope of application and advantages and disadvantages, once selected, then the five modules need to be developed in the same language, which is also convenient for assembly and expansion development.
Mo thought about it, and he finally settled on C++ because this language is closer to the bottom layer and assembly, and the overall execution efficiency and speed are better.
Mo Hui began to search for and screen open source software on the Internet, and downloaded all the software developed by C++ that basically met the requirements, and classified and stored them first.
When Mo Hui downloaded all the more than 30 kinds of reptiles that were used, it was already late at night, Mo Hui rubbed his stiff neck, stretched his waist, and couldn't help but lament for the days to come: I have to work hard, I have to go home desperately, this is to call myself as a donkey~~~