Chapter 11 The Ocean of Data and Crawlers

The data analysis module is the brain of the stock god 1.0, and in addition to this brain, it also needs several other key modules. Pen ~ fun ~ Ge www.biquge.info Since big data analysis is done, then this data source needs to have a functional module to solve, which is specifically responsible for data collection.

This data collection module is like the hands and feet of the gods, responsible for collecting the corresponding data from the network, and a key component of this module is the crawler.

The development of the Internet to this day has become a network system with a huge and complex content, and the scale of information nodes on this system is billions, which may include desktops, laptops, servers, large-scale groups, smart phones, tablets, intelligent navigation terminals, various information collection terminals, information transmitter terminals, and so on.

As long as it can be connected to the Internet and can interact with the Internet, it can be regarded as an information node, such as all kinds of monitoring probes and all kinds of communication base stations throughout the city, strictly speaking, they are one of the nodes.

The types of data that these nodes provide to the Internet are also varied, including text, data, charts, documents, videos, audio, and databases. They are expressed in a variety of forms, from general formats to specialized formats; Together, all of this information forms a vast ocean of data deposited on the Internet.

This ocean of data is dynamic, it is always in motion and updating, just like the currents and waves, which never stops.

The entire ocean of data is scattered among countless information nodes, which are linked by various communication protocols so that they can communicate with each other. Among the various communication protocols, there is one that we are most familiar with, which is URL, which is the website link that we always encounter.

If the entire ocean of data is compared to our earth, then each data node is a room, and the data information in the node is our human beings, and all the information nodes are combined to form countless cities, countless buildings, and countless houses on the earth.

All kinds of data link modes are roads that carry people's travel, and URLs are just a kind of roads, which are railways and highways interconnected between cities and major commercial buildings. It mainly appears among public servers, which means that as long as there is a URL, it is theoretically open to all data visitors, and anyone can reach this server, but whether it has access or not is another matter.

Since there is a public space, there must be a non-public space correspondingly, in addition to URLs, there are many link patterns, in these link patterns, the information node is like a private residence or a military restricted area, it is not open to the public, although it also exists in the ocean of data, but you can't access it casually.

When faced with such a huge and vast ocean of data, there is a problem, the world is so big, how do I find the target, for example, I want to find data information related to cold medicine, what should I do?

It is this need that gives rise to search engines, which can help you find your target quickly, it's like a wayfinding guide, you just tell it where you want to go, what are the characteristics of this destination, it will help you find countless possible destinations, and provide you with the URL of the other party.

The number of visits to the search engine every day is massive, and the concurrent search requests are hundreds of thousands every second, in the face of so many requests, if a request is made to search the Internet, it is certainly unrealistic, not only the speed is slow and inefficient, but also this kind of search request alone is enough to make the entire international Internet into a state of congestion.

In order to solve this problem, the search engine has its own unique working mode, it first finds out as much information as possible in the ocean of data, and then stores it in its own server group, once there is a search request, it only needs to retrieve it in its own server.

And it is the crawler that helps the search engine to complete the action of finding it.

Because the information nodes in the Internet are all interrelated and mesh-like, there will be many URLs on each node. So the working mode of the crawler is traversal, when it starts to work, it will start with an information node, and then search for all the nodes connected to this node one by one, and when the next node has a URL link, it will continue to visit until it traverses all the URLs once.

Because of the mesh structure of the entire Internet, it has mesh interoperability, so when the crawler traverses all the URLs, generally speaking, it has visited all the links of the entire Internet, which is destined to be a more breathtaking behavior than traveling around the world.

Since Mo Hui wants to get this stock god 1.0, he wants to collect massive data, so what he wants to do is actually very similar to what the search engine wants to do, except that the search engine collects all the information, and Mo Hui only needs to pay attention to the information related to stocks.

In this way, Mo Hui's reptile must have the ability to traverse and also have the ability to screen.

The ability to traverse is explained very simply, that is, you can't go back and wrong, and you can't go through the URL a second time. When a new URL is discovered, you need to determine whether the URL has already been traveled, and secondly, you need to determine the order in which the URL is arranged. One is a repeatability problem and the other is an optimization problem, which requires a unique traversal algorithm to solve.

The filtering function is the main difference between the general crawler and the dedicated crawler, the crawler needs to have a certain recognition ability, be able to identify whether the content in a URL is relevant, if not, then skip, if there is, copy the content back for use.

This filtering function also requires a lot of algorithms to solve, and not only that, but it also needs to have natural language processing capabilities, that is, it must have the ability to understand and parse the language and text, and it must be able to identify which text content is relevant to stocks and which is useless.

It's not enough to just recognize text, it also has to be able to recognize data in other formats, such as patterns, such as stock-related candlestick charts, bar charts, etc., which the crawler must be able to distinguish from landscapes or selfies.

In addition to pictures, other things like video, audio, various databases, etc., crawlers need to identify them one by one to determine whether they belong to relevant content.

There will be a myriad of technical problems to solve, and it would be almost unthinkable if Mo Hui was left to do it alone.