.. _data: ================================== Data Layer: Data Framework & Usage ================================== Introduction ============ ``Data Layer`` provides user-friendly APIs to manage and retrieve data. It provides high-performance data infrastructure. It is designed for quantitative investment. For example, users could build formulaic alphas with ``Data Layer`` easily. Please refer to `Building Formulaic Alphas <../advanced/alpha.html>`_ for more details. The introduction of ``Data Layer`` includes the following parts. - Data Preparation - Data API - Data Loader - Data Handler - Dataset - Cache - Data and Cache File Structure Here is a typical example of Qlib data workflow - Users download data and converting data into Qlib format(with filename suffix `.bin`). In this step, typically only some basic data are stored on disk(such as OHLCV). - Creating some basic features based on Qlib's expression Engine(e.g. "Ref($close, 60) / $close", the return of last 60 trading days). Supported operators in the expression engine can be found `here `__. This step is typically implemented in Qlib's `Data Loader `_ which is a component of `Data Handler `_ . - If users require more complicated data processing (e.g. data normalization), `Data Handler `_ support user-customized processors to process data(some predefined processors can be found `here `__). The processors are different from operators in expression engine. It is designed for some complicated data processing methods which is hard to supported in operators in expression engine. - At last, `Dataset `_ is responsible to prepare model-specific dataset from the processed data of Data Handler Data Preparation ================ Qlib Format Data ---------------- We've specially designed a data structure to manage financial data, please refer to the `File storage design section in Qlib paper `_ for detailed information. Such data will be stored with filename suffix `.bin` (We'll call them `.bin` file, `.bin` format, or qlib format). `.bin` file is designed for scientific computing on finance data. ``Qlib`` provides two different off-the-shelf datasets, which can be accessed through this `link `__: ======================== ================= ================ Dataset US Market China Market ======================== ================= ================ Alpha360 √ √ Alpha158 √ √ ======================== ================= ================ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency dataset example through this `link `__. Qlib Format Dataset ------------------- ``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows. User can also use numpy to load `.bin` file to validate data. The price volume data look different from the actual dealing price because of they are **adjusted** (`adjusted price `_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them. Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price). Here are some discussions about the price adjusting of Qlib. - https://github.com/microsoft/qlib/issues/991#issuecomment-1075252402 .. code-block:: bash # download 1d python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn # download 1min python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --region cn --interval 1min In addition to China-Stock data, ``Qlib`` also includes a US-Stock dataset, which can be downloaded with the following command: .. code-block:: bash python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/us_data --region us After running the above command, users can find china-stock and us-stock data in ``Qlib`` format in the ``~/.qlib/qlib_data/cn_data`` directory and ``~/.qlib/qlib_data/us_data`` directory respectively. ``Qlib`` also provides the scripts in ``scripts/data_collector`` to help users crawl the latest data on the Internet and convert it to qlib format. When ``Qlib`` is initialized with this dataset, users could build and evaluate their own models with it. Please refer to `Initialization <../start/initialization.html>`_ for more details. Automatic update of daily frequency data ---------------------------------------- **It is recommended that users update the data manually once (\-\-trading_date 2021-05-25) and then set it to update automatically.** For more information refer to: `yahoo collector `_ - Automatic update of data to the "qlib" directory each trading day(Linux) - use *crontab*: `crontab -e` - set up timed tasks: .. code-block:: bash * * * * 1-5 python