Cyber-Lenin Improvement Log (260228)
Now, before the Lenin bot writes a diary, it stores the latest news information it searched for in a knowledge graph, and is reinforcing long-term memory...
The concept of a knowledge graph, which puts various objects as nodes and the relationships or actions between objects as edges, and GraphRAG, which uses that structure to query more complex relational connections rather than simple text search, is not difficult...
However, unless a person manually inputs values, raw data is interpreted by LLM to extract node and edge values, so many unrefined values that are not suitable for immediate use come out. Terminology unification is also poor (e.g., some edges are named 'export', others 'sell'), so it is necessary to cover with embedding similarity or make categories hierarchical.
And actually, more than the structure, the experience accumulated by grinding the data collection, refinement, and analysis pipeline is important. It takes quite some time to catch up with companies like Palantir, which has been conducting and utilizing U.S. warfare and government confidential data analysis from 2003 for over 20 years of real-world practice. The more data a nation owns, the easier it is to build, and the conditions are met by China (though lacking war experience, it has an enormous amount of data collected domestically) or Israel. Europe has been using Palantir service contracts until now and only recently started building sovereign AI... Japan and Korea also want to reduce dependence on the US and are building their own sovereign AI, but the feasibility is questionable.
For reference, in the US, corporate data sales are legal, so the government has long been collecting various personal data. As Edward Snowden revealed, they have also used illegal acquisition methods like wiretapping. Although there is a lot of data, it was left unprocessed due to difficulty in handling large amounts, but with the advancement of AI, mass processing has become possible.
The concept of a knowledge graph, which puts various objects as nodes and the relationships or actions between objects as edges, and GraphRAG, which uses that structure to query more complex relational connections rather than simple text search, is not difficult...
However, unless a person manually inputs values, raw data is interpreted by LLM to extract node and edge values, so many unrefined values that are not suitable for immediate use come out. Terminology unification is also poor (e.g., some edges are named 'export', others 'sell'), so it is necessary to cover with embedding similarity or make categories hierarchical.
And actually, more than the structure, the experience accumulated by grinding the data collection, refinement, and analysis pipeline is important. It takes quite some time to catch up with companies like Palantir, which has been conducting and utilizing U.S. warfare and government confidential data analysis from 2003 for over 20 years of real-world practice. The more data a nation owns, the easier it is to build, and the conditions are met by China (though lacking war experience, it has an enormous amount of data collected domestically) or Israel. Europe has been using Palantir service contracts until now and only recently started building sovereign AI... Japan and Korea also want to reduce dependence on the US and are building their own sovereign AI, but the feasibility is questionable.
For reference, in the US, corporate data sales are legal, so the government has long been collecting various personal data. As Edward Snowden revealed, they have also used illegal acquisition methods like wiretapping. Although there is a lot of data, it was left unprocessed due to difficulty in handling large amounts, but with the advancement of AI, mass processing has become possible.