Azure AI Studio indexing seems to be broke now

Steven Parry 0 Reputation points
2024-09-16T18:44:12.14+00:00

Within Azure AI Studio, in the Indexes section, im creating a new index. The payload is a bunch of webpages. Add Vector search to this search resource is enabled.

After a few minutes I recieve the error

User's image It would seem that something has broken between 0.0.42 and the older 0.0.38 versions of llm_rag_crack_and_chunk_and_embed

As I have attempted to recreate a new index using the exact same payload with the new 0.0.42 version and its failing, i then clone the older 0.0.38 job and run it again and it works. These are the errors from the log

[2024-09-16 17:11:21] INFO     azureml.rag.crack_and_chunk - Processing file: www.surreyilc.org.uk.html (crack_and_chunk.py:127)
[2024-09-16 17:11:22] ERROR    azureml.rag.crack_and_chunk_and_embed.create_embeddings - ActivityCompleted: Activity=create_embeddings, HowEnded=Failure, Duration=252844.63 [ms], Exception=AttributeError (activity.py:127)
[2024-09-16 17:11:22] ERROR    azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - ServiceError: intepreted error = Rag system error, original error = 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' (exceptions.py:124)
[2024-09-16 17:11:27] ERROR    azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - crack_and_chunk failed with exception: Traceback (most recent call last):
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 506, in main_wrapper
    map_exceptions(main, activity_logger, args, logger, activity_logger)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
    raise e
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
    return func(*func_args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 475, in main
    embeddings_container = crack_and_chunk_and_embed(
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 344, in crack_and_chunk_and_embed
    num_embedded = create_embeddings(
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/embed.py", line 312, in create_embeddings
    for chunk in chunks:
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 218, in documents_to_embed
    for chunked_doc in chunked_docs:
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/chunking.py", line 169, in split_documents
    for i, document in enumerate(documents):
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 376, in crack_documents
    raise e
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 365, in crack_documents
    yield loader.load_chunked_document()
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 71, in load_chunked_document
    pages = self.load()
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 132, in load
    docs = super().load()
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/langchain/vendor/document_loaders/unstructured.py", line 79, in load
    elements = self._get_elements()
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/documents/cracking.py", line 148, in _get_elements
    return partition_html(file=self.file, **self.unstructured_kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
    elements = func(*args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
    elements = func(*args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
    elements = list(
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
    elements = list(elements)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
    yield from cls(opts)._iter_elements()
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
    for e in self._main.iter_elements():
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
    yield from self._element_from_text_or_tail(block_item.tail or "", q)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
    for node in self._iter_text_segments(text, q):
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
    while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
 (crack_and_chunk_and_embed.py:508)
[2024-09-16 17:11:27] ERROR    azureml.rag.crack_and_chunk_and_embed.crack_and_chunk_and_embed - ActivityCompleted: Activity=crack_and_chunk_and_embed, HowEnded=Failure, Duration=259423.54 [ms], Exception=AttributeError (activity.py:127)
Traceback (most recent call last):
  File "/azureml-envs/rag-embeddings/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/azureml-envs/rag-embeddings/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk_and_embed.py", line 559, in 
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
2,945 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,795 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 10,176 Reputation points
    2024-09-16T19:40:50.35+00:00

    Hello Steven Parry,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your indexing is broken in Azure AI Studio.

    The summary of the error shows that the object does not have the attribute is_phrasing in the 0.0.42 version of the llm_rag_crack_and_chunk_and_embed module in Azure AI Studio.

    There might be a change or a new requirement in the 0.0.42 version, you will need to:

    • Implement additional error handling around the crack_and_chunk_and_embed function to catch and log more details about the error in your logic.
    • You can continue using the previous 0.0.38 version until the issue with the 0.0.42 version is resolved.
    • If the issue persists, consider reaching out to Azure support with the detailed error logs.

    Check out here https://learn.microsoft.com/en-us/azure/search/cognitive-search-common-errors-warnings some similar errors and warning.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.