Please share with the community what you think needs improvement with Amazon Redshift.
What are its weaknesses? What would you like to see changed in a future version?
Amazon should provide more cloud-native tools that can integrate with Redshift like Microsoft's development tools for Azure.
Planting is the primary key enforcement that should be improved but there is probably a reason that they don't follow the reference architecture. It means they are creating clones of the data shading. Cost control measures could be improved along with added transparency.
We recently moved from the DC2 cluster to the RA3 cluster, which is a different node type and we are finding some issues with the RA3 cluster regarding connection and processing. There is room for improvement in this area. We are in talks with AWS regarding the connection issues. In an upcoming release, I would like to have a Snowflake-like feature where we can create another cluster in the same data warehouse, with the same data. You can create a different cluster and compute nodes for each of your use cases, for retail, and for your data analyst all while keeping your underlying data safe. Additionally, the cluster resize process takes down the cluster for too long, approximately 15 minutes. There are limitations to the size, you can resize only by a multiplier of two, for example, if you have four nodes then you can either go to eight nodes or you can come down to two nodes. There should be fewer limitations.
Improvement could be made in the area of streaming data. The capability can definitely be improved. There are other products like Kinesis which is a separate service we use for streaming data ingestion. Whatever features are missing in Redshift, they have separate sources but if there were the feasibility to ingest real-time streaming data directly into RedShift, that would be very useful.
Redshift is a multi-tier engine that works like a calculator. There is some missing functionality and sometimes it's so difficult to work in. We need to convert these functionalities using VACUUM inside Amazon Redshift and then it causes some complexity. Sometimes I'd like for them to support some special features or some special installations because we need automatic populations. I would like to see more programming outside of the cloud. I would like to see more functionalities under JSON files. the only functionality that they have now with JSON is reports. I would also like to see other data sources like MongoDB.
We have had some challenges with respect to considering some of the high-end availability architecture for production. We don't find many issues now, but initially, we had some challenges. This is an older product, so when it comes to usability, it requires a technical person to work with it. It requires a specialist and a good business case to work on it. It has to be a little more user-friendly than what it is today. In our experiments, the handling of unstructured data was not very smooth.
I would like a better way to ingest data in realtime because there is a bit too much latency. There are too many limitations with respect to concurrency. It is now possible to auto-scale it, although that is still slow. It could offer smaller nodes with decoupling of storage and processing because for the moment, the only nodes available to work that way are huge, and for large companies.
The OLAP slide and dice features need to be improved. For example, if a business wants to bring in a general ledger from an ERP, they want to slice and dice the data. What we have found is that they have a lot of formulas that are used to calculate metrics, so what we do is use SQL Server Analysis Services. The question then becomes one of adopting a single vendor and transitioning to Azure. If Redshift had similar capabilities then it would be very good.
The managing updates, deletes, and role-level change performance is very low. For example, while you are doing inserts, updates, deletes, and amalgamates, the performance is very, very poor. If you want to query the database after you have a lot of terabytes of data, the load, performance-wise, is very low. Looking at the performance of the query, querying the database, and especially with the amalgamates when it is getting updated, it is really poor. We like this solution and have tried all of the native services; they were working quite well. The only concern about Redshift was managing the cluster, especially the EMR cluster. Our company policy was not to use EMR clusters, especially with the nodes failing. There were many instances of downtime happening. Essentially, there was too much data traffic. The other drawback was the CDC, as we do not have any tools that can support it. Creating the structure is easy on the DDL side, but after you create the table and you want to transform the data to store it in a database, the performance is poor. It takes a lot of time to ingest and update the data. After you ingest the data and someone wants to fetch it in the table, it takes a lot of time performance-wise to return the results.
It would be useful to have an option where all of the data can be queried at once and then have the result shown. As it is now, when we run a query and we are looking at the results, part of the data remains to be processed at the back end. That works very well, but in some cases, we require the whole data to be queried at once and then have the results shown. We have not faced many use cases where it would have been useful, but in one or two, we used other methods to achieve this goal. When our clients contact customer support, they don't want to speak with a machine. Instead, they want to chat with a real person who can provide a solution. Customer service bots can provide solutions but they cannot understand our problems.
Pricing is one of the concerns that I have because if you compare Snowflake with Redshift, it provides some of the same services, but at a much cheaper rate. So pricing is one of the things that it could improve. It should be more competitive. Otherwise, everything else looks good, especially the data storage and analytical processes.
From my perspective, the product could be improved by making it more flexible. There are now more flexible products on the market that allow for expandability and dynamic expansion as the market changes with regard to data warehouses. Although the product is simple to use there can be problems. If you declare some unique key in a column and then store it, the database is going to believe this is what you have and results will be distorted. It's fine if the query is simple but if it's complex or you have too many queries per hour, it can create a bottleneck for Redshift and then you can't return and recover. It requires some fine-tuning. For additional features, I would like to see support for partitions, it doesn't exist yet as a feature. It's quite an important issue when you're dealing with large databases. Also, I believe the product needs improvement in parallel threading to support more database users without jeopardizing performance.
Running parallel queries results in poor performance and this needs to be improved.
The speed of the solution and its portability needs improvement.
Compatibility with other products, for example, Microsoft and Google, is a bit difficult because each one of them wants to be isolated with their solutions. That's a big problem now.
In the next release, a pivot function would be a big help. It could save a lot of time creating a query or process to handle operations.
Which is better and why?