task-log-6118d51c44d2f6d2ffffffffffffffffffffffff01000000
task-log-6118d51c44d2f6d2ffffffffffffffffffffffff01000000
database_utils:__init__:42
- MongoDB connection successful with URI:
mongodb+srv://sentineldbuser:[email protected]/
2025-02-13 23:12:33.133 | INFO |
src.database_utils.database_utils:create_mongo_index:128 - Index status_batch_id
already exists on recipeactions.
2025-02-13 23:12:33.135 | INFO |
src.database_utils.database_utils:create_mongo_index:128 - Index status already
exists on recipeactions.
2025-02-13 23:12:33.137 | INFO |
src.database_utils.database_utils:create_mongo_index:128 - Index _id_status already
exists on recipeactionruns.
2025-02-13 23:12:33.262 | INFO | src.aws.s3_helper:_initialize_s3_client:97 -
Using EC2 IAM Role for S3 operations in region us-east-1
2025-02-13 23:12:33.262 |INFO | ray:process_item_ray:909 | data_prepro: {'ocr':
{'enabled': True, 'method': 'textract', 'force_recreate': True, 'extract_images':
True, 'extract_tables': False, 'extract_layouts': False, 'extract_forms': False,
'credentials': {'type': 'aws', 'properties': {'aws_credential_type': 'arn_role',
'aws_region': 'us-east-1', 'aws_external_id': '679cd3091fed3f5d66e4aeef',
'aws_iam_role_arn': 'arn:aws:iam::120569633920:role/karini-legal-role'}}},
'pii_masking': {'enabled': False, 'entities': {}, 'force_recreate': True,
'credentials': {'type': 'aws', 'properties': {'aws_credential_type': 'arn_role',
'aws_region': 'us-east-1', 'aws_external_id': '679cd3091fed3f5d66e4aeef',
'aws_iam_role_arn': 'arn:aws:iam::120569633920:role/karini-legal-role'}}},
'chunking': {'type': 'recursive', 'tokenizer': 'cl100k_base', 'overlap': 50,
'size': 525, 'force_recreate': True}, 'preprocessing_setting':
{'custom_lambda_preprocessor': {}, 'custom_metadata_extraction': {}}}
2025-02-13 23:12:33.393 | INFO |
src.services.components.preprocessing.preprocessor:__init__:85 - Using Assumed role
arn:aws:iam::120569633920:role/karini-legal-role with external id
679cd3091fed3f5d66e4aeef for AWS Textract
2025-02-13 23:12:33.395 | INFO |
src.services.components.chunking.chunking:__init__:48 - Chunking event: {'dataset':
{'dataset_id': '67aeec0db172dd7f07d39470', 'dataset_type': 'text',
'dataset_sources': [{'id': 'FLW_1', 'recursive': True, 'connector_type': 'aws',
'dataset_connector_id': '67aeec17b172dd7f07d394cc', 'credentials': {'type': 'aws',
'properties': {'aws_credential_type': 'arn_role', 'aws_region': 'us-east-1',
'aws_external_id': '679cd3091fed3f5d66e4aeef', 'aws_iam_role_arn':
'arn:aws:iam::120569633920:role/karini-legal-role'}}, 'path': 's3://karini-legal-
v2-docs/sample/', 'filters': {'filter': []}, 'source_type': 's3'}],
'preprocessing': {'ocr': {'enabled': True, 'method': 'textract', 'force_recreate':
True, 'extract_images': True, 'extract_tables': False, 'extract_layouts': False,
'extract_forms': False, 'credentials': {'type': 'aws', 'properties':
{'aws_credential_type': 'arn_role', 'aws_region': 'us-east-1', 'aws_external_id':
'679cd3091fed3f5d66e4aeef', 'aws_iam_role_arn':
'arn:aws:iam::120569633920:role/karini-legal-role'}}}, 'pii_masking': {'enabled':
False, 'entities': {}, 'force_recreate': True, 'credentials': {'type': 'aws',
'properties': {'aws_credential_type': 'arn_role', 'aws_region': 'us-east-1',
'aws_external_id': '679cd3091fed3f5d66e4aeef', 'aws_iam_role_arn':
'arn:aws:iam::120569633920:role/karini-legal-role'}}}, 'chunking': {'type':
'recursive', 'tokenizer': 'cl100k_base', 'overlap': 50, 'size': 525,
'force_recreate': True}, 'preprocessing_setting': {'custom_lambda_preprocessor':
{}, 'custom_metadata_extraction': {}}}, 'force_preprocessing': False, 'embeddings':
{'credentials': {'type': 'aws', 'properties': {'aws_credential_type': 'arn_role',
'aws_region': 'us-east-1', 'aws_external_id': '679cd3091fed3f5d66e4aeef',
'aws_iam_role_arn': 'arn:aws:iam::120569633920:role/karini-legal-role'}},
'dimension': 1024, 'modelid': 'amazon.titan-embed-text-v2:0', 'modelprovider':
'amazon-bedrock', 'endpoint_id': '67a0892abb1320ca3b3a2c37', 'force_recreate':
True, 'parameters': {'modelprovider': 'amazon-bedrock', 'tokenizer': 'cl100k_base',
'dimension': 1024, 'max_tokens': 8000, 'pricing': {'input': {'tokens': 1000,
'currency': '$', 'value': 2e-05}}, 'credentials': {'type': 'aws', 'enabled': False,
'aws': {'credential_type': 'arn_role'}}, 'modelid': 'amazon.titan-embed-text-
v2:0'}}, 'use_local_s3_storage': False}}
2025-02-13 23:12:33.395 | INFO |
src.services.components.chunking.chunking:__init__:51 - Chunking properties:
{'type': 'recursive', 'tokenizer': 'cl100k_base', 'overlap': 50, 'size': 525,
'force_recreate': True}
2025-02-13 23:12:33.890 | INFO |
src.services.components.connectors:get_data_connector:60 - Initializing connector
for type: aws
2025-02-13 23:12:34.007 | INFO | src.aws.s3_helper:_initialize_s3_client:76 -
Assumed role arn:aws:iam::120569633920:role/karini-legal-role for region us-east-1
2025-02-13 23:12:34.008 |INFO | ray:process_item_ray:944 | sourceref:
s3://karini-legal-v2-docs/sample/BHC/2005/BHC_2005_Mr._Mangesh_Govind_Patane_vs_Mr.
_Nagesh_Vasant_Kadam___Others_2005_BHC-AS_6620.pdf
2025-02-13 23:12:36.054 | INFO |
src.database_utils.database_utils:update_dataset_items:650 - Matched documents:1
2025-02-13 23:12:36.054 | INFO |
src.database_utils.database_utils:update_dataset_items:651 - Modified documents:1
2025-02-13 23:12:36.054 | INFO |
src.services.pipelines.data_ingestion_ray:get_processed_data_standalone:640 - ---
Processing data using OCR options ---:textract
2025-02-13 23:12:36.054 | INFO |
src.services.components.preprocessing.preprocessor:process:445 - Using Textract for
OCR Processing (without table extraction)
2025-02-13 23:12:36.054 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_with_textract:179 -
Extracting text from pdf page images using Textract
2025-02-13 23:12:36.054 | INFO |
src.services.components.preprocessing.preprocessor:pdf_to_images:157 - Processing
PDF with 14 pages
2025-02-13 23:12:39.152 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:12:47.470 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:12:51.466 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:12:53.001 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:12:56.539 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:13:21.549 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:13:23.271 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:13:33.366 | INFO |
src.services.components.preprocessing.preprocessor:extract_text_from_image:190 -
Got 200 Textract response
2025-02-13 23:13:41.177 | ERROR |
src.services.components.preprocessing.preprocessor:extract_text_from_image:199 -
Error in extract_text_from_image: An error occurred
(ProvisionedThroughputExceededException) when calling the DetectDocumentText
operation: Provisioned rate exceeded
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/
default_worker.py", line 297, in <module>
worker.main_loop()
│ └ <function Worker.main_loop at 0x7f65c7c458a0>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py",
line 935, in main_loop
self.core_worker.run_task_loop()
│ │ └ <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects>
│ └ <ray._raylet.CoreWorker object at 0x7f65c7c70f40>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/
default_worker.py", line 297, in <module>
worker.main_loop()
│ └ <function Worker.main_loop at 0x7f65c7c458a0>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py",
line 935, in main_loop
self.core_worker.run_task_loop()
│ │ └ <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects>
│ └ <ray._raylet.CoreWorker object at 0x7f65c7c70f40>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/workers/
default_worker.py", line 297, in <module>
worker.main_loop()
│ └ <function Worker.main_loop at 0x7f65c7c458a0>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_private/worker.py",
line 935, in main_loop
self.core_worker.run_task_loop()
│ │ └ <method 'run_task_loop' of 'ray._raylet.CoreWorker' objects>
│ └ <ray._raylet.CoreWorker object at 0x7f65c7c70f40>
└ <ray._private.worker.Worker object at 0x7f65c7e1e840>